[mvapich-discuss] help: Poll CQ failed!

Jeff Haferman jeff.haferman at gmail.com
Thu Dec 31 01:56:24 EST 2009


DK -
Thanks, this is helpful.

I tried ib_rdma_lat between 2 nodes and received
Conflicting CPU frequency values detected: 2336.000000 != 2003.000000
Latency typical: inf usec
Latency best   : inf usec
Latency worst  : inf usec

I tried rping between 2 nodes using
rping -S 100 -d -v -c -a compute-ib-1-0
on the client side and received
verbose
client
created cm_id 0x14490c70
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x14490c70 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x14490c70 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x144933d0
created channel 0x144933f0
created cq 0x14493410
created qp 0x14493550
rping_setup_buffers called on cb 0x1448e010
allocated & registered buffers...
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x14490c70 (parent)
ESTABLISHED
rmda_connect successful
RDMA addr 14493a90 rkey e002801 len 100
send completion
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x14490c70 (parent)
client DISCONNECT EVENT...
wait for RDMA_WRITE_ADV state 6
poll error -2
rping_free_buffers called on cb 0x1448e010
destroy cm_id 0x14490c70

I did a little bit of searching and found a message on openfabrics.org that
was similar and suggests that we might need to upgrade firmware on our
switch, so, I will look into that.  Any other ideas would be appreciated.

Jeff

On Wed, Dec 30, 2009 at 8:48 PM, Dhabaleswar Panda <panda at cse.ohio-state.edu
> wrote:

> Jeff - This clearly seems to be a hardware/set-up issue. You can try
> running verbs-level tests (rdma_latency, etc.) across different nodes to
> make sure that the IB fabric is stable. You may also contact the hardware
> vendor for additional information.
>
> Thanks,
>
> DK
>
> On Wed, 30 Dec 2009, Jeff Haferman wrote:
>
> >
> > Well, we updated to OFED 1.4.1-4, and tried the mvapich (1.1.0) and
> > openmpi (1.2.8) supplied with it.  I'm still seeing the same problems.
> >
> > With mvapich I do
> > mpirun -np 16 -machinefile ./hostfile.ib ./cpi
> > which SOMETIMES bombs with
> > Abort signaled by rank 12: Exit code -3 signaled from compute-ib-1-1
> > Killing remote processes...[compute-1-1.local:12] Got error polling CQ
> >
> > with openmpi I do
> > mpirun --mca btl openib,self -np 16 --hostfile hostfile.orte ./cpi
> > which SOMETIMES bombs with
> >
> [compute-ib-1-0][0,1,1][btl_openib_component.c:1357:btl_openib_component_progress]
> > error polling HP CQ with -2 errno says Success
> >
> > These machines have ethernet interfaces and I can run mpich / openmpi
> > fine over the ethernet interfaces.  Sometimes the IB runs work, the IB
> > runs always work if I run on a single node (each has 8 cores).  If I run
> > between 2 IB nodes it usually works but sometimes bombs.  More than 2
> > nodes and it usually does not work.  I noticed that my first run of the
> > day will usually run fine, so it almost tells me that something is not
> > being cleaned up, but "ipcs" shows everything to be clean.
> >
> > As I said, with the new OFED, I used the stacks provided with the OFED,
> > I'll try compiling my own tomorrow, but any ideas would be appreciated.
> >
> > This is new IB hardware from Sun, the ibdiagnet tests don't show any
> > problems, but I don't know quite what to do from here.
> >
> > Jeff
> >
> >
> > Dhabaleswar Panda wrote:
> > > Let us know what you observe with mvapich 1.1 branchh version.
> > >
> > > I also notice that you are using OFED 1.3.1. This is an older version.
> > > Since you are using QDR switch (with DDR adapters), you may try to
> update
> > > your system to the latest OFED 1.4.* version. The GA rellease of OFED
> 1.5
> > > will also be coming out soon and you can use it.
> > >
> > > DK
> > >
> > > On Tue, 29 Dec 2009, Jeff Haferman wrote:
> > >
> > >> Hi DP
> > >>
> > >> All of our hardware is Sun (and to my knowledge are manufactured by
> > >> mellanox):
> > >> Switch:  part #X2821A-Z  36-port QDR switch
> > >> HCAS: part #X4216A-Z dual-port DDR PCI-E IB HCA
> > >>
> > >> Network cables are definitely connected tightly, we've double-checked
> this.
> > >> OFED version is 1.3.1.
> > >> The ethernet connections have their own separate NICs, they are on the
> 172
> > >> subnet, the IB interfaces are on the 10 subnet. We've been running
> other MPI
> > >> stacks over ethernet for a year and do most of our work over the
> ethernet
> > >> interfaces, so I feel pretty good about that.  We've been running
> lustre
> > >> over IB, and it seems to be working.
> > >>
> > >> My configure line for MVAPICH2 looks like:
> > >> ./configure --with-rdma=gen2 --with-arch=LINUX -prefix=${PREFIX} \
> > >>        --enable-cxx --enable-debug \
> > >>        --enable-devdebug \
> > >>        --enable-f77 --enable-f90 \
> > >>        --enable-romio \
> > >>        --with-file-system=lustre+nfs \
> > >>        --with-link=DDR
> > >>
> > >> and I also have set the following in my environment:
> > >> export CC=pgcc
> > >> export CXX=pgCC
> > >> export F77=pgf90
> > >> export F90=pgf90
> > >> export RSHCOMMAND=ssh
> > >>
> > >>
> > >> Any ideas?  I will try MVAPICH 1.1.1 later today but perhaps you see
> > >> something obvious in my configuration.
> > >>
> > >> Jeff
> > >>
> > >>
> > >> On Mon, Dec 28, 2009 at 10:27 PM, Dhabaleswar Panda <
> > >> panda at cse.ohio-state.edu> wrote:
> > >>
> > >> > Hi Jeff,
> > >> >
> > >> > Thanks for your report. This seems to be some kind of
> > >> > systems-related/set-up issues. Could you let us know what kind of
> adapters
> > >> > and switch you are using. Are all the network cables connected
> tightly?
> > >> > Which OFED version is being used? How is your Ethernet connections
> set-up
> > >> > for the nodes in the cluster? The mpirun_rsh job-startup framework
> uses
> > >> > TCP/IP initially to set-up connections.
> > >> >
> > >> > Also, are you configuring mvapich and mvapich2 properly, i.e., using
> the
> > >> > OpenFabrics-Gen2 interface? Mvapich 1.0.1 stack is very old. Please
> use
> > >> > the latest MVAPICH 1.1 branch version.
> > >> >
> > >> > http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/
> > >> >
> > >> > If you have multi-core nodes (say 8/16 cores per node), you can try
> > >> > running mvapich1 or mvapich2 with 8/16 MPI processes on a given node
> > >> > first. Then you can try running the same number of MPI processes
> across
> > >> > multiple nodes (say using 8 MPI processes on 8 nodes using 1
> > >> > process/node). Then you can run experiments involving multiple cores
> and
> > >> > nodes. Running such separate tests will help you to isolate the
> problem on
> > >> > your set-up and correct it.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > DK
> > >> >
> > >> >
> > >> > On Mon, 28 Dec 2009, Jeff Haferman wrote:
> > >> >
> > >> > >
> > >> > > I've built four mvapich 1.0.1 stacks (PGI, gnu, intel, sun) and
> > >> > > one mvapich 2.1.4 stack (PGI) and I'm getting the same problem
> with all
> > >> > of
> > >> > > them just running the simple "cpi" test:
> > >> > >
> > >> > > With mvapich1:
> > >> > > mpirun -np 16 -machinefile ./hostfile.16 ./cpi
> > >> > > Abort signaled by rank 6: Error polling CQ
> > >> > > MPI process terminated unexpectedly
> > >> > > Signal 15 received.
> > >> > > DONE
> > >> > >
> > >> > > With mvapich2:
> > >> > > mpirun_rsh -ssh -np 3 -hostfile ./hostfile.16 ./cpi
> > >> > > Fatal error in MPI_Init:
> > >> > > Internal MPI error!, error stack:
> > >> > > MPIR_Init_thread(311).........: Initialization failed
> > >> > > MPID_Init(191)................: channel initialization failed
> > >> > > MPIDI_CH3_Init(163)...........:
> > >> > > MPIDI_CH3I_RDMA_init(190).....:
> > >> > > rdma_ring_based_allgather(545): Poll CQ failed!
> > >> > >
> > >> > >
> > >> > > The INTERESTING thing is that sometimes these run successfully!
>  They
> > >> > > almost always run with 2-4 processors, but generally fail with
> more than
> > >> > > 4 processors (and my hostfile is setup to ensure that the
> processors are
> > >> > > on physically separate nodes).  Today I've actually had a hard
> time
> > >> > > getting mvapich1 to fail with any number of processors.
> > >> > >
> > >> > > The ibdiagnet tests show no problems.
> > >> > >
> > >> > > Where do I go from here?
> > >> > >
> > >> > > Jeff
> > >> > >
> > >> > > _______________________________________________
> > >> > > mvapich-discuss mailing list
> > >> > > mvapich-discuss at cse.ohio-state.edu
> > >> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >> > >
> > >> >
> > >> >
> > >>
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091230/2042513e/attachment-0001.html


More information about the mvapich-discuss mailing list