[mvapich-discuss] MPI_Ibcast issue with MVAPICH2-GDR 2.3a

Marius Brehler marius.brehler at tu-dortmund.de
Tue Sep 18 03:27:14 EDT 2018


Hi,

thanks for the suggestions.

Running the application with MV2_CUDA_ENABLE_IPC_CACHE=0, it freezes
after some time (0% GPU-Utilization reported by nvidia-smi). With
MV2_CUDA_IPC=0 is passes, but seems to be incredibly slow.

I wrote a small reproducer, showing the same behavior. To whom should I
send it? I further noticed that it is not the first MPI_Ibcast call that
causes the issue, but it fails after some calls.

Regards


Marius


On 9/13/18 4:27 PM, Chu, Ching-Hsiang wrote:
> Hi, Marius,
>
>
> May I suggest the followings.
>
>
>  1. You could try to set "MV2_CUDA_ENABLE_IPC_CACHE=0" or
>     "MV2_CUDA_IPC=0", to see if your application can pass.
>  2. Is it possible that you could share your application or a small
>     reproducer? so our team can investigate it further.
>
>
> Thanks,
>
> Ching-Hsiang Chu
>
> ------------------------------------------------------------------------
> *From:* mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> on
> behalf of Marius Brehler <marius.brehler at tu-dortmund.de>
> *Sent:* Saturday, September 8, 2018 7:33 AM
> *To:* mvapich-discuss at cse.ohio-state.edu
> *Subject:* [mvapich-discuss] MPI_Ibcast issue with MVAPICH2-GDR 2.3a
>
> Hi,
> I am currently facing the following error while using CUDA-aware
> MPI_Ibcast on a EC2 p2.8xlarge instance:
>
> [ip-172-31-43-122.us-east-2.compute.internal:mpi_rank_0][cudaipc_register]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_ipc.c:283:
> cudaIpcOpenMemHandle failed
> : No such file or directory (2)
>
> The application involves 8 GPUs, and needs to send extremely large
> messages. In the simulation each GPU has to share 30x2^(21) real valued
> elements, e.g. 480 MiB. The applications fails with the error message
> above. Halving the problem size so that each GPU needs to share 240 MiB,
> the algorithm passes. However, processing the halved problem on 4 GPUs,
> so that each needs to share 480 MiB again, the algorithm passes.
>
> Since the aws instance features no IB HCA, I set MV2_USE_GPUDIRECT=0.
> Toggling MV2_CUDA_USE_IPC_BCAS has no influence on the issue. Due to
> comparing different implementations relying on different communication
> patterns, I am quite sure that the problem is linked to MPI_Ibcast. The
> version of MVAPICH2-GDR used is 2.3a, build with GNU 4.8.5 (w/o SLURM)
> for MLNX-OFED 4.3 and CUDA 9.2. Any idea what may have gone wrong?
> Regards
>
>
> Marius
>
> -- M.Sc. Marius Brehler
> Research Associate/Ph.D. Candidate
>
> TU Dortmund University
> Chair for High Frequency Technology
> 44227 Dortmund, Germany
> Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich. Sie
> ist ausschließlich für den Adressaten bestimmt. Sollten Sie nicht der
> für diese E-Mail bestimmte Adressat sein, unterrichten Sie bitte den
> Absender und vernichten Sie diese Mail. Vielen Dank.
> Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen
> ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher
> Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung
> eines solchen Schriftstücks per Telefax erfolgen.
>
> Important note: The information included in this e-mail is confidential.
> It is solely intended for the recipient. If you are not the intended
> recipient of this e-mail please contact the sender and delete this
> message. Thank you. Without prejudice of e-mail correspondence, our
> statements are only legally binding when they are made in the
> conventional written form (with personal signature) or when such
> documents are sent by fax.
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> mvapich-discuss Info Page - Ohio State University
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
> mailman.cse.ohio-state.edu
> (The subscribers list is only available to the list administrator.)
> Enter your admin address and password to visit the subscribers list:
Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich. Sie ist ausschließlich für den Adressaten bestimmt. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, unterrichten Sie bitte den Absender und vernichten Sie diese Mail. Vielen Dank.
Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung eines solchen Schriftstücks per Telefax erfolgen.

Important note: The information included in this e-mail is confidential. It is solely intended for the recipient. If you are not the intended recipient of this e-mail please contact the sender and delete this message. Thank you. Without prejudice of e-mail correspondence, our statements are only legally binding when they are made in the conventional written form (with personal signature) or when such documents are sent by fax.



More information about the mvapich-discuss mailing list