[mvapich-discuss] Persistent "error(22): Could not modify boot qp to RTS" - MVAPICH2 2.0(a/b) with dual Mellanox Connect-IB MT27600

sreeram potluri potluri.2 at osu.edu
Fri Dec 13 10:27:57 EST 2013


Hi Filippo,

This was a known issue in MVAPICH2 2.0a with Connect-IB adapters but should
have been fixed with 2.0b. You can work around this error in 2.0a by
exporting MV2_ON_DEMAND_THRESHOLD=1. Can you please double check if this is
the same issue you are seeing with 2.0b?

Regards
Sreeram Potluri


On Thu, Dec 12, 2013 at 3:58 PM, Filippo Spiga <spiga.filippo at gmail.com>wrote:

> Dear MVAPICH developers and users,
>
> I am struggling with a problem that I cannot track back to its source. I
> am testing a new GPU cluster deployed at my institution. Before checking
> any GPU-aware capability (I am particularly interested in GPU Direct over
> RDMA) I want to run the NVIDIA GPU-aware HPL. Immediately after submission,
> the application crashes printing plenty of these messages....
>
> [src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could
> not modify boot qp to RTS
> [src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could
> not modify boot qp to RTS
> [src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could
> not modify boot qp to RTS
> [src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could
> not modify boot qp to RTS
> [src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could
> not modify boot qp to RTS
> [src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could
> not modify boot qp to RTS
> [src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could
> not modify boot qp to RTS
> [src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could
> not modify boot qp to RTS
> [src/mpid/ch3/channels/mrail/src/gen2/ring_startup.c:334] error(22): Could
> not modify boot qp to RTS
>
> I looked at google without much success. The node configuration is dual
> socked Ivy Bridge with two Mellanox Connect-IB card. OFED
> is MLNX_OFED_LINUX-2.0-3.0.0. I compiled MVAPICH 2.0a (same for 2.0b) with
> these flags (GCC 4.4.7, CUDA 5.5, NVIDIA driver 331.20):
>
> CC=gcc CFLAGS="-O3 -march=native" CXX=g++ CXXFLAGS="-O3 -march=native"
> F77=gfortran FFLAGS="-O3 -march=native" FC=gfortran FCFLAGS="-O3
> -march=native" ./configure --prefix=... --with-device=ch3:mrail
> --with-rdma=gen2 --enable-rdma-cm --enable-romio
> --with-file-system=lustre+nfs --with-hwloc --enable-blcr --with-slurm=...
> --enable-cuda --with-cuda-include=$CUDA_INSTALL_PATH/include
> --with-cuda-lib=$CUDA_INSTALL_PATH/lib64 --enable-threads=default
> --enable-shared --enable-sharedlibs=gcc --enable-cxx --enable-fc
> --enable-f77 --enable-g=none --enable-fast --with-pm=hydra
>
> The only environment variables I explicitly exported in my submission
> scripts are
>
> export MV2_ENABLE_AFFINITY=0
> export MV2_NUM_HCAS=1
> export MV2_RAIL_SHARING_POLICY=ROUND_ROBIN
> export MV2_PROCESS_TO_RAIL_MAPPING=SCATTER
> export MV2_USE_RDMA_FAST_PATH=1
>
> Same problem if I leave the environment clear.
>
> I wonder if someone can point me to my mistake or eventually advise me in
> which direction I should investigate in order to solve this issue.
>
> (less relevant but same code compiled with Intel MPI 4.1 or Open MPI 1.7.3
> works)
>
> Best Regards,
> Filippo
>
> --
> Mr. Filippo SPIGA, M.Sc.
> http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131213/d030f23b/attachment-0001.html>


More information about the mvapich-discuss mailing list