[mvapich-discuss] Persistent "error(22): Could not modify boot qp to RTS" - MVAPICH2 2.0(a/b) with dual Mellanox Connect-IB MT27600

Filippo Spiga spiga.filippo at gmail.com
Thu Dec 12 15:58:20 EST 2013


Dear MVAPICH developers and users,

I am struggling with a problem that I cannot track back to its source. I am testing a new GPU cluster deployed at my institution. Before checking any GPU-aware capability (I am particularly interested in GPU Direct over RDMA) I want to run the NVIDIA GPU-aware HPL. Immediately after submission, the application crashes printing plenty of these messages....

[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could not modify boot qp to RTS
[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could not modify boot qp to RTS
[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could not modify boot qp to RTS
[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could not modify boot qp to RTS
[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could not modify boot qp to RTS
[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could not modify boot qp to RTS
[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could not modify boot qp to RTS
[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could not modify boot qp to RTS
[src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could not modify boot qp to RTS

I looked at google without much success. The node configuration is dual socked Ivy Bridge with two Mellanox Connect-IB card. OFED is MLNX_OFED_LINUX-2.0-3.0.0. I compiled MVAPICH 2.0a (same for 2.0b) with these flags (GCC 4.4.7, CUDA 5.5, NVIDIA driver 331.20):

CC=gcc CFLAGS="-O3 -march=native" CXX=g++ CXXFLAGS="-O3 -march=native" F77=gfortran FFLAGS="-O3 -march=native" FC=gfortran FCFLAGS="-O3 -march=native" ./configure --prefix=... --with-device=ch3:mrail --with-rdma=gen2 --enable-rdma-cm --enable-romio --with-file-system=lustre+nfs --with-hwloc --enable-blcr --with-slurm=... --enable-cuda --with-cuda-include=$CUDA_INSTALL_PATH/include --with-cuda-lib=$CUDA_INSTALL_PATH/lib64 --enable-threads=default --enable-shared --enable-sharedlibs=gcc --enable-cxx --enable-fc --enable-f77 --enable-g=none --enable-fast --with-pm=hydra

The only environment variables I explicitly exported in my submission scripts are

export MV2_ENABLE_AFFINITY=0
export MV2_NUM_HCAS=1
export MV2_RAIL_SHARING_POLICY=ROUND_ROBIN
export MV2_PROCESS_TO_RAIL_MAPPING=SCATTER
export MV2_USE_RDMA_FAST_PATH=1

Same problem if I leave the environment clear.

I wonder if someone can point me to my mistake or eventually advise me in which direction I should investigate in order to solve this issue.

(less relevant but same code compiled with Intel MPI 4.1 or Open MPI 1.7.3 works)

Best Regards,
Filippo

--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131212/4d32643f/attachment.html>


More information about the mvapich-discuss mailing list