[mvapich-discuss] IBV_EVENT_QP_LAST_WQE_REACHED error
Steve Jones
stevejones at stanford.edu
Mon Oct 1 18:19:23 EDT 2007
Hi.
We've moved from using Topspin Drivers to Cisco-OFED 1.2 and OSU
MVAPICH on one cluster & have run into issues with one code when
running on nodes >1, failing with an IBV_EVENT_QP_LAST_WQE_REACHED
error. The application will compile and run correctly on a single node.
*Application on nodes=1:ppn=8*
[smjones at compute-3-11 Test_for_Steve]$ mpiexec ~/NGA/bin/arts
No input file name was detected, using "input".
Step Time CFLmax Umax Vmax
Wmax Divergence
0 0.00000E+00 0.2920 2.190E+01 0.000E+00
0.000E+00 5.702E+04
1 5.00000E-06 1.0322 3.297E+01 1.984E+01
1.281E-04 5.742E-01
2 9.84412E-06 0.9742 3.896E+01 1.902E+01
1.366E-04 5.294E-01
3 1.47779E-05 0.9240 4.261E+01 1.816E+01
1.627E-04 1.824E-02
*Application on nodes=2:ppn=4*
[smjones at compute-3-15 Test_for_Steve]$ mpiexec ~/NGA/bin/arts
No input file name was detected, using "input".
Step Time CFLmax Umax Vmax
Wmax Divergence
[0:compute-3-15.local] Abort: [0] Got FATAL event
IBV_EVENT_QP_LAST_WQE_REACHED, code=16
at line 2551 in file viacheck.c
mpiexec: Warning: accept_abort_conn: MPI_Abort from IP
10.255.255.169, rank 0, killing all.
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
mpiexec: Warning: task 0 exited with status 255.
It's been noted that the application crashes when calling MPI_COMM_SPLIT.
I noticed traffic on the list about running checks with ib_rdma_bw
tests. I've performed the test in advance and have included the output
below. I've also appended the Makefile for MVAPICH in case it's a
build related issue.
Thanks.
Steve
run ib_rdma_bw test-
[smjones at compute-3-21 ~]$ /usr/bin/ib_rdma_bw
25531: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 |
iters=1000 | duplex=0 | cma=0 |
[smjones at compute-3-21 ~]$ /usr/bin/ib_rdma_bw -s1048576 -n100
25587: | port=18515 | ib_port=1 | size=1048576 | tx_depth=100 |
iters=100 | duplex=0 | cma=0 |
25587: Local address: LID 0x229, QPN 0x960405, PSN 0xe28cbd RKey
0x2002400 VAddr 0x00002a959aa000
25587: Remote address: LID 0x231, QPN 0x6e0405, PSN 0xd141ca, RKey
0x28002400 VAddr 0x00002a95bbb000
[smjones at compute-3-20 ~]$ /usr/bin/ib_rdma_bw -s1048576 -n100 c3-21
25551: | port=18515 | ib_port=1 | size=1048576 | tx_depth=100 |
iters=100 | duplex=0 | cma=0 |
25551: Local address: LID 0x231, QPN 0x6e0405, PSN 0xd141ca RKey
0x28002400 VAddr 0x00002a95bbb000
25551: Remote address: LID 0x229, QPN 0x960405, PSN 0xe28cbd, RKey
0x2002400 VAddr 0x00002a959aa000
25551: Bandwidth peak (#0 to #86): 1138.18 MB/sec
25551: Bandwidth average: 1136.94 MB/sec
25551: Service Demand peak (#0 to #86): 1997 cycles/KB
25551: Service Demand Avg : 1999 cycles/KB
#!/bin/bash
source ./make.mvapich.def
arch
# Mandatory variables. All are checked except CXX and F90.
MTHOME=/usr
PREFIX=/share/apps/mvapich/intel
export CC=icc
export CXX=icc
export F77=ifort
export F90=ifort
export RSHCOMMAND=ssh
IO_BUS="_PCI_EX_"
ARCH="_EM64T_"
LINKS="_DDR_"
export LIBS="-L${MTHOME}/lib64 -libverbs -libumad -libcommon -lpthread"
export FFLAGS="-L${MTHOME}/lib64 -xP -fPIC"
export CFLAGS="-D${ARCH} -D__INTEL_COMPILER -g -DCH_GEN2
-DMEMORY_SCALE -D_AFFINITY_ \
-D_SMP_ -D_SMP_RNDV_ -DVIADEV_RPUT_SUPPORT \
-fPIC -DEARLY_SEND_COMPLETION -DLAZY_MEM_UNREGISTER \
-D${IO_BUS} -D${LINKS} \
-I${MTHOME}/include -I${MTHOME}/include/rdma \
-I/opt/panfs/include"
export CCFLAGS="-lstdc++"
# Prelogue
make distclean &>/dev/null
# Configure MVAPICH
echo "Configuring MVAPICH..."
./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \
--enable-cxx --enable-debug \
--enable-devdebug \
--enable-f77 --enable-f90 \
--with-romio --with-file-system=ufs+nfs+panfs \
--without-mpe \
-lib="-L${MTHOME}/lib64 -libverbs -libumad -libcommon -lpthread"
More information about the mvapich-discuss
mailing list