[Mvapich-discuss] Cannot build Horovod with mvapich2-gdr

You, Zhi-Qiang zyou at osc.edu
Wed Jul 21 11:12:47 EDT 2021


Thank you for the reply. I have another question regarding Horovod.

I can now build horovod with mvapich2-gdr at OSC but I encounter an error when I run pytorch benchmark with 2 GPUs on single node:

[p0235.ten.osc.edu:mpi_rank_1][MPIDI_CH3I_MRAIL_Rndv_transfer_device_ipc] src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:412: cudaEventRecord failed: Invalid argument (22)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
[p0235.ten.osc.edu:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6)
srun: error: p0235: task 1: Aborted (core dumped)

The commands I tried:

LD_PRELOAD=$OSC_MVAPICH2_LIB/libmpi.so srun -n 2 --gpu_cmode=shared --cpu-bind=none  python pytorch_synthetic_benchmark.py --batch-size=64 --model=resnet50

or

LD_PRELOAD=$OSC_MVAPICH2_LIB/libmpi.so srun -n 2 --gpu_cmode=shared --cpu-bind=none $MPICH_HOME/libexec/osu-micro-benchmarks/get_local_rank  python pytorch_synthetic_benchmark.py --batch-size=64 --model=resnet50

The MV2 variables used:

MV2_HOMOGENEOUS_CLUSTER=1
MV2_CPU_BINDING_POLICY=hybrid
MV2_USE_GDRCOPY=1
MV2_USE_CUDA=1
MV2_IBA_HCA=mlx5_0:mlx5_2
MV2_GPUDIRECT_GDRCOPY_LIB=/usr/lib64/libgdrapi.so
MV2_USE_RDMA_CM=0


The packages installed:
horovod                 0.22.1
torch                   1.9.0
torchvision             0.10.0


-ZQ

From: Anthony, Quentin G. <anthony.301 at buckeyemail.osu.edu>
Date: Saturday, July 17, 2021 at 11:01 AM
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>, You, Zhi-Qiang <zyou at osc.edu>
Subject: Re: [Mvapich-discuss] Cannot build Horovod with mvapich2-gdr
Hey Zhi-Qiang,

Currently, we recommend manually updating the prefix, exec_prefix, sysconfdir, includedir, and libdir paths in mpicc and mpicxx.

As long as your loaded CUDA module version matches that of the RPM, the CUDA prefix paths should be correct for your system. If that's not the case, let us know and we can generate you a new RPM with the correct CUDA version and prefix.

Thanks,
-Quentin
________________________________
From: Mvapich-discuss <mvapich-discuss-bounces+anthony.301=osu.edu at lists.osu.edu> on behalf of You, Zhi-Qiang via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Friday, July 16, 2021 1:31 PM
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
Subject: Re: [Mvapich-discuss] Cannot build Horovod with mvapich2-gdr


Hello, any update?



-ZQ



From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> on behalf of You, Zhi-Qiang via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Date: Saturday, July 10, 2021 at 3:13 AM
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
Subject: [Mvapich-discuss] Cannot build Horovod with mvapich2-gdr

Hello,



I followed the user guide[1] to build horovod with mvapich2-gdr at OSC and got this CMake error:



$ module reset

$ module load mvapich2-gdr/2.3.5 cmake

$ HOROVOD_GPU_OPERATIONS=MPI HOROVOD_CUDA_HOME=$CUDA_HOME HOROVOD_WITH_MPI=1 pip install --no-cache-dir --ignore-installed horovod

[ .. skipped .. ]

-- The CXX compiler identification is GNU 8.4.0

    -- Check for working CXX compiler: /apps/gnu/8.4.0/bin/c++

    -- Check for working CXX compiler: /apps/gnu/8.4.0/bin/c++ - works

    -- Detecting CXX compiler ABI info

    -- Detecting CXX compiler ABI info - done

    -- Detecting CXX compile features

    -- Detecting CXX compile features - done

    -- Build architecture flags: -mf16c -mavx -mfma

    -- Using command /fs/ess/scratch/PZS0710/zyou/tmp/horovod/bin/python3

    CMake Error in /tmp/pip-install-xt_qt24z/horovod_6386bb2d1fc94c1f9518143ed589a550/build/temp.linux-x86_64-3.6/RelWithDebInfo/CMakeFiles/CMakeTmp/CMakeLists.txt:

      Imported target "MPI::MPI_CXX" includes non-existent path



        "/usr/local/cuda-11.0/include"



      in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

[ .. skipped .. ]



I noticed "/usr/local/cuda-11.0/include” from the flags in mpi wrappers:



$ grep final.*flags= `which mpicxx`

final_cxxflags=" -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic "

final_cppflags=" -I/usr/local/cuda-11.0/include  -I/usr/local/cuda-11.0/include"

final_ldflags=" -L/usr/local/lib -lcuda -L/usr/local/cuda-11.0/lib64/stubs -L/usr/local/cuda-11.0/lib64 -lcudart -lrt -lstdc++ -Wl,-rpath,/usr/local/cuda-11.0/lib64 -Wl,-rpath,XORIGIN/placeholder -L/usr/local/software/slurm/current/lib64/  -fPIC -m64 "

    final_ldflags="${final_ldflags} -L/usr/local/cuda-11.0/lib64/ -L/usr/local/lib -lcuda -L/usr/local/cuda-11.0/lib64/stubs -L/usr/local/cuda-11.0/lib64 -lcudart -lrt -lstdc++ -Wl,-rpath,/usr/local/cuda-11.0/lib64 -Wl,-rpath,XORIGIN/placeholder -L/usr/local/software/slurm/current/lib64/  -fPIC -m64 -L/usr/local/software/slurm/current/lib"



At OSC, we use rpm2cpio method to install mvapich2-gdr. I know using rpm with –prefix can fix part of the flags but it seems no other method to update cuda path automatically. Besides manual modifications to the flags in the wrappers, is there other way to let me proceed to build horovod?



-ZQ


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20210721/b88e5647/attachment-0022.html>


More information about the Mvapich-discuss mailing list