[mvapich-discuss] MPI_Alltoallv hangs in MVAPICH-GDR

VADAMBACHERI MANIAN, KARTHIK vadambacherimanian.1 at osu.edu
Wed Mar 20 16:46:05 EDT 2019


Hi Augustin,

Thanks for reporting the issue with MPI_Alltoallv with us. We tried to reproduce it locally with different clusters and unfortunately we could not reproduce it. To reproduce and debug this further can we get a remote access to the clusters where you see this hang? Kindly let us know.

Thanks,
Karthik 

-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of mvapich-discuss-request at cse.ohio-state.edu
Sent: Monday, March 18, 2019 12:14 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: mvapich-discuss Digest, Vol 159, Issue 19

Send mvapich-discuss mailing list submissions to
	mvapich-discuss at cse.ohio-state.edu

To subscribe or unsubscribe via the World Wide Web, visit
	http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
or, via email, send a message with subject or body 'help' to
	mvapich-discuss-request at cse.ohio-state.edu

You can reach the person managing the list at
	mvapich-discuss-owner at cse.ohio-state.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of mvapich-discuss digest..."


Today's Topics:

   1. Re: MPI_Alltoallv hangs in MVAPICH-GDR (AUGUSTIN DEGOMME)


----------------------------------------------------------------------

Message: 1
Date: Mon, 18 Mar 2019 17:14:19 +0100
From: AUGUSTIN DEGOMME <augustin.degomme at univ-grenoble-alpes.fr>
To: "Subramoni, Hari" <subramoni.1 at osu.edu>
Cc: "mvapich-discuss at cse.ohio-state.edu"
	<mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] MPI_Alltoallv hangs in MVAPICH-GDR
Message-ID:
	<821056912.637564.1552925659492.JavaMail.zimbra at univ-grenoble-alpes.fr>
	
Content-Type: text/plain; charset="utf-8"

Hi, 

I added the INSTALLPATH bit for the report, to hide the details about the local machine path. It was indeed replaced by the path used to unpack the lib (and it's indeed important to avoid using another installation by mistake). In my setup the correct libmpi was being used. 


I've quickly built a docker image which should show the issue (recipe attached). If not run with nvidia-docker it uses nvidia stubs to still run non-GPU codes, hence the warning at the start. I didn't try with cuda activated as I don't have a setup running correctly today. 


sudo docker pull degomme/mvapich_bug 
sudo docker run --rm -ti degomme/mvapich_bug bash 


mpirun -np 2 /opt/mvapich2/gdr/2.3.1/mcast/no-openacc/cuda10.0/mofed4.5/mpirun/gnu4.8.5/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall 
[ee38f87dbb7a:mpi_rank_0][MPIDI_CH3_Init] MVAPICH2 has been built with support for CUDA. But, MV2_USE_CUDA not set to 1. This can lead to errors in using GPU buffers. If you are running applications that use GPU buffers, please set MV2_USE_CUDA=1 and try again. 
[ee38f87dbb7a:mpi_rank_0][MPIDI_CH3_Init] To suppress this warning, please set MV2_SUPPRESS_CUDA_USAGE_WARNING to 1 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
WARNING: 

You should always run with libnvidia-ml.so that is installed with your 
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. 
libnvidia-ml.so in GDK package is a stub library that is attached only for 
build purposes (e.g. machine that you build your application doesn't have 
to have Display Driver installed). 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 

# OSU MPI All-to-All Personalized Exchange Latency Test v5.6.1 
# Size Avg Latency(us) 
1 1.45 
2 1.34 
4 1.41 
... 

so allotall works fine 

mpirun -np 2 /opt/mvapich2/gdr/2.3.1/mcast/no-openacc/cuda10.0/mofed4.5/mpirun/gnu4.8.5/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoallv 
[ee38f87dbb7a:mpi_rank_0][MPIDI_CH3_Init] MVAPICH2 has been built with support for CUDA. But, MV2_USE_CUDA not set to 1. This can lead to errors in using GPU buffers. If you are running applications that use GPU buffers, please set MV2_USE_CUDA=1 and try again. 
[ee38f87dbb7a:mpi_rank_0][MPIDI_CH3_Init] To suppress this warning, please set MV2_SUPPRESS_CUDA_USAGE_WARNING to 1 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
WARNING: 

You should always run with libnvidia-ml.so that is installed with your 
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. 
libnvidia-ml.so in GDK package is a stub library that is attached only for 
build purposes (e.g. machine that you build your application doesn't have 
to have Display Driver installed). 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 

# OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.1 
# Size Avg Latency(us) 
^C[mpiexec at ee38f87dbb7a] Sending Ctrl-C to processes as requested 

alltoallv hangs (on the two systems I tried to run this image on). I also tried with mofed 3.4 instead of 4.5, with the same result. 

Best regards, 

Augustin 





De: "Subramoni, Hari" <subramoni.1 at osu.edu> 
?: "AUGUSTIN DEGOMME" <augustin.degomme at univ-grenoble-alpes.fr>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu> 
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu> 
Envoy?: Lundi 18 Mars 2019 15:16:28 
Objet: RE: MPI_Alltoallv hangs in MVAPICH-GDR 



Hi, Augustin. 



I just tested this on our local cluster and both intra-node and inter-node executions of osu_alltoallv works fine for several processes. 



I have a very dumb question. In the set of commands pasted below, I see the following export. 



export LD_LIBRARY_PATH=INSTALLPATH/opt/mvapich2/gdr/2.3.1/mcast/no-openacc/cuda10.0/mofed4.5/mpirun/gnu4.8.5/lib64/:$LD_LIBRARY_PATH 



Under the assumption that these were the exact commands you used, could the issue be that the variable ?INSTALLPATH? has not been defined or that a ?$? was missed before ?INSTALLPATH?? 



Thx, 

Hari. 




From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> On Behalf Of AUGUSTIN DEGOMME 
Sent: Monday, March 18, 2019 5:04 AM 
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu> 
Subject: [mvapich-discuss] MPI_Alltoallv hangs in MVAPICH-GDR 





Hello, 





With CUDA disabled (untested on the GPU version), MVAPICH2-GDR hangs forever on MPI_Alltoallv calls. This was reproduced on our code, but also using the osu benchmark in the installation folder of mvapich as well. Alltoall and all others works fine, only alltoallv hangs. 





This was tested both with the released 2.3 and 2.3.1 versions of MVAPICH2-GDR, bith on a debian testing system and on an ubuntu 18.04 docker image. I was not able to reproduce when using mvapich2 from sources, only with GDR. 





Steps to reproduce : 





I installed [ http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.1/mofed4.5/mvapich2-gdr-mcast.cuda10.0.mofed4.5.gnu4.8.5-2.3.1-1.el7.x86_64.rpm | 
http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.1/mofed4.5/mvapich2-gdr-mcast.cuda10.0.mofed4.5.gnu4.8.5-2.3.1-1.el7.x86_64.rpm ] on a debian system with rpm2cpio 





rpm2cpio mvapich2-gdr-mcast.cuda10.0.mofed4.5.gnu4.8.5-2.3.1-1.el7.x86_64.rpm.1 | cpio -id 


export LD_LIBRARY_PATH=INSTALLPATH/opt/mvapich2/gdr/2.3.1/mcast/no-openacc/cuda10.0/mofed4.5/mpirun/gnu4.8.5/lib64/:$LD_LIBRARY_PATH 


export MV2_USE_CUDA=0 


export MV2_USE_GDRCOPY=0 





./opt/mvapich2/gdr/2.3.1/mcast/no-openacc/cuda10.0/mofed4.5/mpirun/gnu4.8.5/bin/mpirun -np 2 opt/mvapich2/gdr/2.3.1/mcast/no-openacc/cuda10.0/mofed4.5/mpirun/gnu4.8.5/libexec/osu-micro-benchmarks/mpi/collectives/osu_alltoallv 





[xxx:mpi_rank_0][MPIDI_CH3_Init] MVAPICH2 has been built with support for CUDA. But, MV2_USE_CUDA not set to 1. This can lead to errors in using GPU buffers. If you are running applications that use GPU buffers, please set MV2_USE_CUDA=1 and try again. 
[xxx:mpi_rank_0][MPIDI_CH3_Init] To suppress this warning, please set MV2_SUPPRESS_CUDA_USAGE_WARNING to 1 

# OSU MPI All-to-Allv Personalized Exchange Latency Test v5.6.1 
# Size Avg Latency(us) 



... and nothing 





every single other osu_* test in the folder works as expected (including ialltoallv). strace reports a massive amount of sched_yield() = 0 calls. 





Best regards, 





Augustin Degomme 


CEA/IRIG 
















-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190318/9ff9491f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Dockerfile_mvapich
Type: application/octet-stream
Size: 5188 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20190318/9ff9491f/attachment.obj>

------------------------------

Subject: Digest Footer

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


------------------------------

End of mvapich-discuss Digest, Vol 159, Issue 19
************************************************



More information about the mvapich-discuss mailing list