[mvapich-discuss] Hang in CH3 SMP Rendezvous protocol w/ CUDA w/o Infiniband

khaled hamidouche hamidouc at cse.ohio-state.edu
Thu Jan 22 12:30:08 EST 2015


Hi Paul,

 We are not able to reproduce  your issue. I tried both H-H and D-D with
different MV2_SMP_EAGERSIZE values (4K,8K,16K ...) and on a node without IB
HCA and all the tests passed. Would you please provide more information
about your platform/ system.

Thanks



On Wed, Jan 21, 2015 at 4:38 PM, Paul Sathre <sath6220 at cs.vt.edu> wrote:

> Hello all,
>
> I am in the process of developing some GPGPU library code atop MPI, and we
> selected MVAPICH due to its demonstrated support for GPUDirect
> communication. However, in shared memory tests on a local node that is not
> equipped with Infiniband, we are unable to perform MPI_Send/MPI_Recv pairs
> beyond the MV2_SMP_EAGERSIZE threshold due to a hang internal to MVAPICH -
> both for device and host buffers.
>
> I have tested this is present in both mvapich2-2.1rc1 and mvapich2-2.0,
> and confirmed the hang is not restricted to our code, as the same behavior
> is exhibited by the osu_latency bechmark - the last output before hanging
> is exactly half of MV2_SMP_EAGERSIZE. (I've tested all power-of-two sizes
> from 16K to 1M with MV2_SMPI_LENGTH_QUEUE fixed to 4x the eager size,
> observing the same behavior.)
>
> I have been unable to diagnose whether the hang is in the initial
> rendezvous handshake, or the actual transfer of the large buffer.
>
> My configure line is:
>  ../mvapich2-2.1rc1/configure
> --prefix=/home/psath/mvapich2-2.1rc1/build/install --enable-cuda
> --disable-mcast
>
> (Run from ~/mvapich2-2.1rc1/build, assuming source is in
> ~/mvapich2-2.1rc1/mvapich2-2.1rc1/) I am forced to disable multicast as our
> dev node doesn't have Infiniband or associated header files. Enabling CUDA
> is a requirement for us.
>
> The node is running Ubuntu Linux 3.11.0-14-generic 64-bit and gcc 4.6.4.
>
> We are able to continue debugging our non-GPUDirect fallback code paths
> (our own host-staging of buffers) with standard MPICH in the mean time, but
> going forward would prefer the performance afforded by sidestepping the
> host when possible.
>
> Please let me know if there is any other information I can provide that
> would help with diagnosing the issue.
>
> Thanks!
>
> -Paul Sathre
> Research Programmer - Synergy Lab
> Dept. of Computer Science
> Virginia Tech
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150122/82458892/attachment.html>


More information about the mvapich-discuss mailing list