[mvapich-discuss] help: Poll CQ failed!

Dhabaleswar Panda panda at cse.ohio-state.edu
Tue Dec 29 01:27:16 EST 2009


Hi Jeff,

Thanks for your report. This seems to be some kind of
systems-related/set-up issues. Could you let us know what kind of adapters
and switch you are using. Are all the network cables connected tightly?
Which OFED version is being used? How is your Ethernet connections set-up
for the nodes in the cluster? The mpirun_rsh job-startup framework uses
TCP/IP initially to set-up connections.

Also, are you configuring mvapich and mvapich2 properly, i.e., using the
OpenFabrics-Gen2 interface? Mvapich 1.0.1 stack is very old. Please use
the latest MVAPICH 1.1 branch version.

http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/

If you have multi-core nodes (say 8/16 cores per node), you can try
running mvapich1 or mvapich2 with 8/16 MPI processes on a given node
first. Then you can try running the same number of MPI processes across
multiple nodes (say using 8 MPI processes on 8 nodes using 1
process/node). Then you can run experiments involving multiple cores and
nodes. Running such separate tests will help you to isolate the problem on
your set-up and correct it.

Thanks,

DK


On Mon, 28 Dec 2009, Jeff Haferman wrote:

>
> I've built four mvapich 1.0.1 stacks (PGI, gnu, intel, sun) and
> one mvapich 2.1.4 stack (PGI) and I'm getting the same problem with all of
> them just running the simple "cpi" test:
>
> With mvapich1:
> mpirun -np 16 -machinefile ./hostfile.16 ./cpi
> Abort signaled by rank 6: Error polling CQ
> MPI process terminated unexpectedly
> Signal 15 received.
> DONE
>
> With mvapich2:
> mpirun_rsh -ssh -np 3 -hostfile ./hostfile.16 ./cpi
> Fatal error in MPI_Init:
> Internal MPI error!, error stack:
> MPIR_Init_thread(311).........: Initialization failed
> MPID_Init(191)................: channel initialization failed
> MPIDI_CH3_Init(163)...........:
> MPIDI_CH3I_RDMA_init(190).....:
> rdma_ring_based_allgather(545): Poll CQ failed!
>
>
> The INTERESTING thing is that sometimes these run successfully!  They
> almost always run with 2-4 processors, but generally fail with more than
> 4 processors (and my hostfile is setup to ensure that the processors are
> on physically separate nodes).  Today I've actually had a hard time
> getting mvapich1 to fail with any number of processors.
>
> The ibdiagnet tests show no problems.
>
> Where do I go from here?
>
> Jeff
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list