[mvapich-discuss] MVAPICH2-GDR on AWS EC2 instance

Marius Brehler marius.brehler at tu-dortmund.de
Thu Aug 23 04:12:52 EDT 2018


Hi Ammar,

thanks for clarification. Since MV2_USE_GPUDIRECT only applies to GPU
Direct RDMA it is obvious that this did not worked out. It might help
others to add a parameter description to section 7. of the GDR
userguide. As mentioned, I wrongly assumed that MV2_USE_GPUDIRECT would
also affect GPUDirect peer-to-peer transfers between GPUs on the same
system.
Best Regards

Marius



On 8/22/18 7:20 PM, Ammar Ahmad Awan wrote:
> Hi Marius,
>
> MVAPICH2-GDR requires the nv_peer_mem module to be installed on a
> machine to utilize the GPU Direct RDMA (GDR) support. The
> flag MV2_USE_GPUDIRECT only applies to the GDR support.
>
> If you don't have nv_peer_mem installed on your machine, please do
> MV2_USE_GPUDIRECT=0.
>
> Regarding the EC2 instance, if you don't have an IB HCA, GPU Direct RDMA
> is not supported as that refers to the mechanism where an HCA can
> directly read from the GPU's memory.
>
> Regards,
> Ammar
>
>
>
> On Tue, Aug 21, 2018 at 4:02 AM Marius Brehler
> <marius.brehler at tu-dortmund.de <mailto:marius.brehler at tu-dortmund.de>>
> wrote:
>
>     Hi,
>
>     yet another update. The issue occurs with MV2_USE_GPUDIRECT=1, with
>     MV2_USE_GPUDIRECT=0 the application executes as expected. Currently
>     testing on a p2.xlarge with only one GPU.
>     It is mentioned in the GDR useguide that "For such systems [regular OFED
>     or GPUs without GPUDirect RDMA], MVAPICH2-GDR 2.3a efficiently takes
>     advantage of CUDA IPC and GDRCOPY features." Should GPUDIRECT therefore
>     be disabled? Following https://developer.nvidia.com/gpudirect, I assumed
>     that "GPUDirect peer-to-peer transfers and memory access" between cards
>     would be affected by this parameter. Or does the MV2_USE_GPUDIRECT
>     parameter only applies to "GPUDirect support for RDMA"?
>     Please excuse the noise and thanks in advance for clarification.
>     Best Regards
>
>     Marius
>
>
>     On 08/21/2018 12:49 AM, Marius Brehler wrote:
>     > Hi,
>     > in correction to my prior mail, the shutdown was initiated by the used
>     > job script and is not caused by the segfault. Sorry for this
>     misleading
>     > information. Anyway, I would appreciate any hints if there is
>     something
>     > special to keep in mind setting up MVAPICH2-GDR on an EC2 instance.
>     > Best Regards
>     >
>     > Marius
>     >
>     > Am 20. August 2018 23:18:22 MESZ schrieb Marius Brehler
>     > <marius.brehler at tu-dortmund.de
>     <mailto:marius.brehler at tu-dortmund.de>>:
>     >
>     >     Hi,
>     >
>     >     I am trying to use the CUDA-aware MPI_Ibcast within an
>     application on an
>     >     AWS EC2 instance of type p2.8xlarge. This instance does not has an
>     >     Infiniband HCA. If I am informed correctly, it is nevertheless
>     >     recommended to use the GDR version if GPUs are involved.
>     >
>     >     I noticed that libmpi.so is linked against libibmad, libibumad and
>     >     libibverbs. For simplicity I installed the libs together with the
>     >     Mellanox OFED (which might not be a good idea).
>     >     As stated in the GDR userguide I also installed gdrcopy, but
>     skipped the
>     >     nv_peer_mem module. I skipped it since our application runs
>     fine on a
>     >     local node without loading the nv_peer_mem module. Our local
>     node has
>     >     two K40c and also a Mellanox HCA installed (not involved in any
>     >     communication at the moment). The remaining configuration of
>     the CentOS
>     >     on the local node and the EC2 instance is quite similar.
>     >
>     >     Unfortunately, our application is crashing so badly that the EC2
>     >     instanced instantly shuts down:
>     >
>     >     [.. .compute.internal:mpi_rank_0][error_sighandler] Caught error:
>     >     Segmentation fault (signal 11)
>     >     [.. .compute.internal:mpi_rank_0][error_sighandler] Caught error:
>     >     Segmentation fault (signal 11)
>     >     [.. .compute.internal:mpi_rank_1][error_sighandler] Caught error:
>     >     Segmentation fault (signal 11)
>     >
>     >     Any hint what might have gone wrong or howto setup
>     MVAPICH2-GDR on an
>     >     EC2 instance correctly?
>     >     Regards
>     >
>     >     Marius
>     >
>     >     --
>     >     M.Sc. Marius Brehler
>     >     Research Associate/Ph.D. Candidate
>     >
>     >     TU Dortmund University
>     >     Chair for High Frequency Technology
>     >     44227 Dortmund, Germany
>     >     Wichtiger Hinweis: Die Information in dieser E-Mail ist
>     vertraulich. Sie ist ausschließlich für den Adressaten bestimmt.
>     Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein,
>     unterrichten Sie bitte den Absender und vernichten Sie diese Mail.
>     Vielen Dank.
>     >     Unbeschadet der Korrespondenz per E-Mail, sind unsere
>     Erklärungen ausschließlich final rechtsverbindlich, wenn sie in
>     herkömmlicher Schriftform (mit eigenhändiger Unterschrift) oder
>     durch Übermittlung eines solchen Schriftstücks per Telefax erfolgen.
>     >
>     >     Important note: The information included in this e-mail is
>     confidential. It is solely intended for the recipient. If you are
>     not the intended recipient of this e-mail please contact the sender
>     and delete this message. Thank you. Without prejudice of e-mail
>     correspondence, our statements are only legally binding when they
>     are made in the conventional written form (with personal signature)
>     or when such documents are sent by fax.
>     >
>>      ------------------------------------------------------------------------
>     >
>     >     mvapich-discuss mailing list
>mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>     >
>     > /Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich.
>     > Sie ist ausschließlich für den Adressaten bestimmt. Sollten Sie nicht
>     > der für diese E-Mail bestimmte Adressat sein, unterrichten Sie
>     bitte den
>     > Absender und vernichten Sie diese Mail. Vielen Dank.
>     > Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen
>     > ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher
>     > Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung
>     > eines solchen Schriftstücks per Telefax erfolgen.
>     >
>     > Important note: The information included in this e-mail is
>     confidential.
>     > It is solely intended for the recipient. If you are not the intended
>     > recipient of this e-mail please contact the sender and delete this
>     > message. Thank you. Without prejudice of e-mail correspondence, our
>     > statements are only legally binding when they are made in the
>     > conventional written form (with personal signature) or when such
>     > documents are sent by fax. /
>     >
>     >
>     > _______________________________________________
>     > mvapich-discuss mailing list
>     > mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>     >
>     Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich.
>     Sie ist ausschließlich für den Adressaten bestimmt. Sollten Sie
>     nicht der für diese E-Mail bestimmte Adressat sein, unterrichten Sie
>     bitte den Absender und vernichten Sie diese Mail. Vielen Dank.
>     Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen
>     ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher
>     Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung
>     eines solchen Schriftstücks per Telefax erfolgen.
>
>     Important note: The information included in this e-mail is
>     confidential. It is solely intended for the recipient. If you are
>     not the intended recipient of this e-mail please contact the sender
>     and delete this message. Thank you. Without prejudice of e-mail
>     correspondence, our statements are only legally binding when they
>     are made in the conventional written form (with personal signature)
>     or when such documents are sent by fax.
>
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>

--
M.Sc. Marius Brehler
Wissenschaftlicher Mitarbeiter

Technische Universität Dortmund
Lehrstuhl für Hochfrequenztechnik
Friedrich-Wöhler-Weg 4
D-44227 Dortmund

Tel.: +49 231-755 6674
Fax:  +49 231-755 4631
E-Mail: marius.brehler at tu-dortmund.de
Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich. Sie ist ausschließlich für den Adressaten bestimmt. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, unterrichten Sie bitte den Absender und vernichten Sie diese Mail. Vielen Dank.
Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung eines solchen Schriftstücks per Telefax erfolgen.

Important note: The information included in this e-mail is confidential. It is solely intended for the recipient. If you are not the intended recipient of this e-mail please contact the sender and delete this message. Thank you. Without prejudice of e-mail correspondence, our statements are only legally binding when they are made in the conventional written form (with personal signature) or when such documents are sent by fax.



More information about the mvapich-discuss mailing list