[mvapich-discuss] IBV_EVENT_QP_LAST_WQE_REACHED error

amith rajith mamidala mamidala at cse.ohio-state.edu
Tue Oct 2 11:42:36 EDT 2007


Hi Steve,

Can you do these couple of checks?
1. Making sure that the IB installation is the same on both the nodes.
2. Using mpirun_rsh instead of mpiexec.
3. Disabling shared memory collectives by using the environment variable
  VIADEV_USE_SHMEM_COLL=0,

thanks,

-Amith.

On Mon, 1 Oct 2007, Steve Jones wrote:

> Hi.
>
> We've moved from using Topspin Drivers to Cisco-OFED 1.2 and OSU
> MVAPICH on one cluster & have run into issues with one code when
> running on nodes >1, failing with an IBV_EVENT_QP_LAST_WQE_REACHED
> error. The application will compile and run correctly on a single node.
>
> *Application on nodes=1:ppn=8*
>
>   [smjones at compute-3-11 Test_for_Steve]$ mpiexec ~/NGA/bin/arts
>   No input file name was detected, using "input".
>          Step        Time        CFLmax      Umax        Vmax
> Wmax    Divergence
>             0   0.00000E+00      0.2920   2.190E+01   0.000E+00
> 0.000E+00   5.702E+04
>             1   5.00000E-06      1.0322   3.297E+01   1.984E+01
> 1.281E-04   5.742E-01
>             2   9.84412E-06      0.9742   3.896E+01   1.902E+01
> 1.366E-04   5.294E-01
>             3   1.47779E-05      0.9240   4.261E+01   1.816E+01
> 1.627E-04   1.824E-02
>
>
> *Application on nodes=2:ppn=4*
>
>   [smjones at compute-3-15 Test_for_Steve]$ mpiexec ~/NGA/bin/arts
>    No input file name was detected, using "input".
>           Step        Time        CFLmax      Umax        Vmax
> Wmax    Divergence
>   [0:compute-3-15.local] Abort: [0] Got FATAL event
> IBV_EVENT_QP_LAST_WQE_REACHED, code=16
>    at line 2551 in file viacheck.c
>   mpiexec: Warning: accept_abort_conn: MPI_Abort from IP
> 10.255.255.169, rank 0, killing all.
>   forrtl: error (78): process killed (SIGTERM)
>   forrtl: error (78): process killed (SIGTERM)
>   forrtl: error (78): process killed (SIGTERM)
>   forrtl: error (78): process killed (SIGTERM)
>   mpiexec: Warning: task 0 exited with status 255.
>
> It's been noted that the application crashes when calling MPI_COMM_SPLIT.
>
> I noticed traffic on the list about running checks with ib_rdma_bw
> tests. I've performed the test in advance and have included the output
> below. I've also appended the Makefile for MVAPICH in case it's a
> build related issue.
>
> Thanks.
>
> Steve
>
>
>
>
>
> run ib_rdma_bw test-
>
> [smjones at compute-3-21 ~]$ /usr/bin/ib_rdma_bw
> 25531: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 |
> iters=1000 | duplex=0 | cma=0 |
>
> [smjones at compute-3-21 ~]$ /usr/bin/ib_rdma_bw -s1048576 -n100
> 25587: | port=18515 | ib_port=1 | size=1048576 | tx_depth=100 |
> iters=100 | duplex=0 | cma=0 |
> 25587: Local address:  LID 0x229, QPN 0x960405, PSN 0xe28cbd RKey
> 0x2002400 VAddr 0x00002a959aa000
> 25587: Remote address: LID 0x231, QPN 0x6e0405, PSN 0xd141ca, RKey
> 0x28002400 VAddr 0x00002a95bbb000
>
>
> [smjones at compute-3-20 ~]$ /usr/bin/ib_rdma_bw -s1048576 -n100 c3-21
> 25551: | port=18515 | ib_port=1 | size=1048576 | tx_depth=100 |
> iters=100 | duplex=0 | cma=0 |
> 25551: Local address:  LID 0x231, QPN 0x6e0405, PSN 0xd141ca RKey
> 0x28002400 VAddr 0x00002a95bbb000
> 25551: Remote address: LID 0x229, QPN 0x960405, PSN 0xe28cbd, RKey
> 0x2002400 VAddr 0x00002a959aa000
>
>
> 25551: Bandwidth peak (#0 to #86): 1138.18 MB/sec
> 25551: Bandwidth average: 1136.94 MB/sec
> 25551: Service Demand peak (#0 to #86): 1997 cycles/KB
> 25551: Service Demand Avg  : 1999 cycles/KB
>
>
> #!/bin/bash
>
> source ./make.mvapich.def
> arch
>
> # Mandatory variables.  All are checked except CXX and F90.
> MTHOME=/usr
> PREFIX=/share/apps/mvapich/intel
> export CC=icc
> export CXX=icc
> export F77=ifort
> export F90=ifort
> export RSHCOMMAND=ssh
> IO_BUS="_PCI_EX_"
> ARCH="_EM64T_"
> LINKS="_DDR_"
>
> export LIBS="-L${MTHOME}/lib64 -libverbs -libumad -libcommon -lpthread"
> export FFLAGS="-L${MTHOME}/lib64 -xP -fPIC"
> export CFLAGS="-D${ARCH} -D__INTEL_COMPILER -g -DCH_GEN2
> -DMEMORY_SCALE -D_AFFINITY_ \
>                 -D_SMP_ -D_SMP_RNDV_ -DVIADEV_RPUT_SUPPORT \
>                 -fPIC -DEARLY_SEND_COMPLETION -DLAZY_MEM_UNREGISTER \
>                 -D${IO_BUS} -D${LINKS} \
>                 -I${MTHOME}/include -I${MTHOME}/include/rdma \
>                 -I/opt/panfs/include"
> export CCFLAGS="-lstdc++"
> # Prelogue
> make distclean &>/dev/null
>
> # Configure MVAPICH
>
> echo "Configuring MVAPICH..."
>
> ./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \
>          --enable-cxx --enable-debug \
>          --enable-devdebug \
>          --enable-f77 --enable-f90 \
>          --with-romio --with-file-system=ufs+nfs+panfs \
>          --without-mpe \
>          -lib="-L${MTHOME}/lib64 -libverbs -libumad -libcommon -lpthread"
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list