[mvapich-discuss] IBV_EVENT_QP_LAST_WQE_REACHED error

Steve Jones stevejones at stanford.edu
Mon Oct 1 18:19:23 EDT 2007


We've moved from using Topspin Drivers to Cisco-OFED 1.2 and OSU  
MVAPICH on one cluster & have run into issues with one code when  
running on nodes >1, failing with an IBV_EVENT_QP_LAST_WQE_REACHED  
error. The application will compile and run correctly on a single node.

*Application on nodes=1:ppn=8*

  [smjones at compute-3-11 Test_for_Steve]$ mpiexec ~/NGA/bin/arts
  No input file name was detected, using "input".
         Step        Time        CFLmax      Umax        Vmax         
Wmax    Divergence
            0   0.00000E+00      0.2920   2.190E+01   0.000E+00    
0.000E+00   5.702E+04
            1   5.00000E-06      1.0322   3.297E+01   1.984E+01    
1.281E-04   5.742E-01
            2   9.84412E-06      0.9742   3.896E+01   1.902E+01    
1.366E-04   5.294E-01
            3   1.47779E-05      0.9240   4.261E+01   1.816E+01    
1.627E-04   1.824E-02

*Application on nodes=2:ppn=4*

  [smjones at compute-3-15 Test_for_Steve]$ mpiexec ~/NGA/bin/arts
   No input file name was detected, using "input".
          Step        Time        CFLmax      Umax        Vmax         
Wmax    Divergence
  [0:compute-3-15.local] Abort: [0] Got FATAL event  
   at line 2551 in file viacheck.c
  mpiexec: Warning: accept_abort_conn: MPI_Abort from IP, rank 0, killing all.
  forrtl: error (78): process killed (SIGTERM)
  forrtl: error (78): process killed (SIGTERM)
  forrtl: error (78): process killed (SIGTERM)
  forrtl: error (78): process killed (SIGTERM)
  mpiexec: Warning: task 0 exited with status 255.

It's been noted that the application crashes when calling MPI_COMM_SPLIT.

I noticed traffic on the list about running checks with ib_rdma_bw  
tests. I've performed the test in advance and have included the output  
below. I've also appended the Makefile for MVAPICH in case it's a  
build related issue.



run ib_rdma_bw test-

[smjones at compute-3-21 ~]$ /usr/bin/ib_rdma_bw
25531: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 |  
iters=1000 | duplex=0 | cma=0 |

[smjones at compute-3-21 ~]$ /usr/bin/ib_rdma_bw -s1048576 -n100
25587: | port=18515 | ib_port=1 | size=1048576 | tx_depth=100 |  
iters=100 | duplex=0 | cma=0 |
25587: Local address:  LID 0x229, QPN 0x960405, PSN 0xe28cbd RKey  
0x2002400 VAddr 0x00002a959aa000
25587: Remote address: LID 0x231, QPN 0x6e0405, PSN 0xd141ca, RKey  
0x28002400 VAddr 0x00002a95bbb000

[smjones at compute-3-20 ~]$ /usr/bin/ib_rdma_bw -s1048576 -n100 c3-21
25551: | port=18515 | ib_port=1 | size=1048576 | tx_depth=100 |  
iters=100 | duplex=0 | cma=0 |
25551: Local address:  LID 0x231, QPN 0x6e0405, PSN 0xd141ca RKey  
0x28002400 VAddr 0x00002a95bbb000
25551: Remote address: LID 0x229, QPN 0x960405, PSN 0xe28cbd, RKey  
0x2002400 VAddr 0x00002a959aa000

25551: Bandwidth peak (#0 to #86): 1138.18 MB/sec
25551: Bandwidth average: 1136.94 MB/sec
25551: Service Demand peak (#0 to #86): 1997 cycles/KB
25551: Service Demand Avg  : 1999 cycles/KB


source ./make.mvapich.def

# Mandatory variables.  All are checked except CXX and F90.
export CC=icc
export CXX=icc
export F77=ifort
export F90=ifort
export RSHCOMMAND=ssh

export LIBS="-L${MTHOME}/lib64 -libverbs -libumad -libcommon -lpthread"
export FFLAGS="-L${MTHOME}/lib64 -xP -fPIC"
                -D_SMP_ -D_SMP_RNDV_ -DVIADEV_RPUT_SUPPORT \
                -D${IO_BUS} -D${LINKS} \
                -I${MTHOME}/include -I${MTHOME}/include/rdma \
export CCFLAGS="-lstdc++"
# Prelogue
make distclean &>/dev/null

# Configure MVAPICH

echo "Configuring MVAPICH..."

./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \
         --enable-cxx --enable-debug \
         --enable-devdebug \
         --enable-f77 --enable-f90 \
         --with-romio --with-file-system=ufs+nfs+panfs \
         --without-mpe \
         -lib="-L${MTHOME}/lib64 -libverbs -libumad -libcommon -lpthread"

More information about the mvapich-discuss mailing list