[mvapich-discuss] MVAPICH2-GDR on AWS EC2 instance

Marius Brehler marius.brehler at tu-dortmund.de
Tue Aug 21 04:00:40 EDT 2018


Hi,

yet another update. The issue occurs with MV2_USE_GPUDIRECT=1, with
MV2_USE_GPUDIRECT=0 the application executes as expected. Currently
testing on a p2.xlarge with only one GPU.
It is mentioned in the GDR useguide that "For such systems [regular OFED
or GPUs without GPUDirect RDMA], MVAPICH2-GDR 2.3a efficiently takes
advantage of CUDA IPC and GDRCOPY features." Should GPUDIRECT therefore
be disabled? Following https://developer.nvidia.com/gpudirect, I assumed
that "GPUDirect peer-to-peer transfers and memory access" between cards
would be affected by this parameter. Or does the MV2_USE_GPUDIRECT
parameter only applies to "GPUDirect support for RDMA"?
Please excuse the noise and thanks in advance for clarification.
Best Regards

Marius


On 08/21/2018 12:49 AM, Marius Brehler wrote:
> Hi,
> in correction to my prior mail, the shutdown was initiated by the used
> job script and is not caused by the segfault. Sorry for this misleading
> information. Anyway, I would appreciate any hints if there is something
> special to keep in mind setting up MVAPICH2-GDR on an EC2 instance.
> Best Regards
>
> Marius
>
> Am 20. August 2018 23:18:22 MESZ schrieb Marius Brehler
> <marius.brehler at tu-dortmund.de>:
>
>     Hi,
>
>     I am trying to use the CUDA-aware MPI_Ibcast within an application on an
>     AWS EC2 instance of type p2.8xlarge. This instance does not has an
>     Infiniband HCA. If I am informed correctly, it is nevertheless
>     recommended to use the GDR version if GPUs are involved.
>
>     I noticed that libmpi.so is linked against libibmad, libibumad and
>     libibverbs. For simplicity I installed the libs together with the
>     Mellanox OFED (which might not be a good idea).
>     As stated in the GDR userguide I also installed gdrcopy, but skipped the
>     nv_peer_mem module. I skipped it since our application runs fine on a
>     local node without loading the nv_peer_mem module. Our local node has
>     two K40c and also a Mellanox HCA installed (not involved in any
>     communication at the moment). The remaining configuration of the CentOS
>     on the local node and the EC2 instance is quite similar.
>
>     Unfortunately, our application is crashing so badly that the EC2
>     instanced instantly shuts down:
>
>     [.. .compute.internal:mpi_rank_0][error_sighandler] Caught error:
>     Segmentation fault (signal 11)
>     [.. .compute.internal:mpi_rank_0][error_sighandler] Caught error:
>     Segmentation fault (signal 11)
>     [.. .compute.internal:mpi_rank_1][error_sighandler] Caught error:
>     Segmentation fault (signal 11)
>
>     Any hint what might have gone wrong or howto setup MVAPICH2-GDR on an
>     EC2 instance correctly?
>     Regards
>
>     Marius
>
>     --
>     M.Sc. Marius Brehler
>     Research Associate/Ph.D. Candidate
>
>     TU Dortmund University
>     Chair for High Frequency Technology
>     44227 Dortmund, Germany
>     Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich. Sie ist ausschließlich für den Adressaten bestimmt. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, unterrichten Sie bitte den Absender und vernichten Sie diese Mail. Vielen Dank.
>     Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung eines solchen Schriftstücks per Telefax erfolgen.
>
>     Important note: The information included in this e-mail is confidential. It is solely intended for the recipient. If you are not the intended recipient of this e-mail please contact the sender and delete this message. Thank you. Without prejudice of e-mail correspondence, our statements are only legally binding when they are made in the conventional written form (with personal signature) or when such documents are sent by fax.
>
>     ------------------------------------------------------------------------
>
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> /Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich.
> Sie ist ausschließlich für den Adressaten bestimmt. Sollten Sie nicht
> der für diese E-Mail bestimmte Adressat sein, unterrichten Sie bitte den
> Absender und vernichten Sie diese Mail. Vielen Dank.
> Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen
> ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher
> Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung
> eines solchen Schriftstücks per Telefax erfolgen.
>
> Important note: The information included in this e-mail is confidential.
> It is solely intended for the recipient. If you are not the intended
> recipient of this e-mail please contact the sender and delete this
> message. Thank you. Without prejudice of e-mail correspondence, our
> statements are only legally binding when they are made in the
> conventional written form (with personal signature) or when such
> documents are sent by fax. /
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich. Sie ist ausschließlich für den Adressaten bestimmt. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, unterrichten Sie bitte den Absender und vernichten Sie diese Mail. Vielen Dank.
Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung eines solchen Schriftstücks per Telefax erfolgen.

Important note: The information included in this e-mail is confidential. It is solely intended for the recipient. If you are not the intended recipient of this e-mail please contact the sender and delete this message. Thank you. Without prejudice of e-mail correspondence, our statements are only legally binding when they are made in the conventional written form (with personal signature) or when such documents are sent by fax.



More information about the mvapich-discuss mailing list