[mvapich-discuss] help: Poll CQ failed!

Jeff Haferman jeff.haferman at gmail.com
Tue Dec 29 11:49:46 EST 2009


Hi DP

All of our hardware is Sun (and to my knowledge are manufactured by
mellanox):
Switch:  part #X2821A-Z  36-port QDR switch
HCAS: part #X4216A-Z dual-port DDR PCI-E IB HCA

Network cables are definitely connected tightly, we've double-checked this.
OFED version is 1.3.1.
The ethernet connections have their own separate NICs, they are on the 172
subnet, the IB interfaces are on the 10 subnet. We've been running other MPI
stacks over ethernet for a year and do most of our work over the ethernet
interfaces, so I feel pretty good about that.  We've been running lustre
over IB, and it seems to be working.

My configure line for MVAPICH2 looks like:
./configure --with-rdma=gen2 --with-arch=LINUX -prefix=${PREFIX} \
       --enable-cxx --enable-debug \
       --enable-devdebug \
       --enable-f77 --enable-f90 \
       --enable-romio \
       --with-file-system=lustre+nfs \
       --with-link=DDR

and I also have set the following in my environment:
export CC=pgcc
export CXX=pgCC
export F77=pgf90
export F90=pgf90
export RSHCOMMAND=ssh


Any ideas?  I will try MVAPICH 1.1.1 later today but perhaps you see
something obvious in my configuration.

Jeff


On Mon, Dec 28, 2009 at 10:27 PM, Dhabaleswar Panda <
panda at cse.ohio-state.edu> wrote:

> Hi Jeff,
>
> Thanks for your report. This seems to be some kind of
> systems-related/set-up issues. Could you let us know what kind of adapters
> and switch you are using. Are all the network cables connected tightly?
> Which OFED version is being used? How is your Ethernet connections set-up
> for the nodes in the cluster? The mpirun_rsh job-startup framework uses
> TCP/IP initially to set-up connections.
>
> Also, are you configuring mvapich and mvapich2 properly, i.e., using the
> OpenFabrics-Gen2 interface? Mvapich 1.0.1 stack is very old. Please use
> the latest MVAPICH 1.1 branch version.
>
> http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/
>
> If you have multi-core nodes (say 8/16 cores per node), you can try
> running mvapich1 or mvapich2 with 8/16 MPI processes on a given node
> first. Then you can try running the same number of MPI processes across
> multiple nodes (say using 8 MPI processes on 8 nodes using 1
> process/node). Then you can run experiments involving multiple cores and
> nodes. Running such separate tests will help you to isolate the problem on
> your set-up and correct it.
>
> Thanks,
>
> DK
>
>
> On Mon, 28 Dec 2009, Jeff Haferman wrote:
>
> >
> > I've built four mvapich 1.0.1 stacks (PGI, gnu, intel, sun) and
> > one mvapich 2.1.4 stack (PGI) and I'm getting the same problem with all
> of
> > them just running the simple "cpi" test:
> >
> > With mvapich1:
> > mpirun -np 16 -machinefile ./hostfile.16 ./cpi
> > Abort signaled by rank 6: Error polling CQ
> > MPI process terminated unexpectedly
> > Signal 15 received.
> > DONE
> >
> > With mvapich2:
> > mpirun_rsh -ssh -np 3 -hostfile ./hostfile.16 ./cpi
> > Fatal error in MPI_Init:
> > Internal MPI error!, error stack:
> > MPIR_Init_thread(311).........: Initialization failed
> > MPID_Init(191)................: channel initialization failed
> > MPIDI_CH3_Init(163)...........:
> > MPIDI_CH3I_RDMA_init(190).....:
> > rdma_ring_based_allgather(545): Poll CQ failed!
> >
> >
> > The INTERESTING thing is that sometimes these run successfully!  They
> > almost always run with 2-4 processors, but generally fail with more than
> > 4 processors (and my hostfile is setup to ensure that the processors are
> > on physically separate nodes).  Today I've actually had a hard time
> > getting mvapich1 to fail with any number of processors.
> >
> > The ibdiagnet tests show no problems.
> >
> > Where do I go from here?
> >
> > Jeff
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091229/641a44b2/attachment.html


More information about the mvapich-discuss mailing list