[mvapich-discuss] Hang in CH3 SMP Rendezvous protocol w/ CUDA w/o Infiniband

Paul Sathre sath6220 at cs.vt.edu
Wed Jan 21 16:38:31 EST 2015


Hello all,

I am in the process of developing some GPGPU library code atop MPI, and we
selected MVAPICH due to its demonstrated support for GPUDirect
communication. However, in shared memory tests on a local node that is not
equipped with Infiniband, we are unable to perform MPI_Send/MPI_Recv pairs
beyond the MV2_SMP_EAGERSIZE threshold due to a hang internal to MVAPICH -
both for device and host buffers.

I have tested this is present in both mvapich2-2.1rc1 and mvapich2-2.0, and
confirmed the hang is not restricted to our code, as the same behavior is
exhibited by the osu_latency bechmark - the last output before hanging is
exactly half of MV2_SMP_EAGERSIZE. (I've tested all power-of-two sizes from
16K to 1M with MV2_SMPI_LENGTH_QUEUE fixed to 4x the eager size, observing
the same behavior.)

I have been unable to diagnose whether the hang is in the initial
rendezvous handshake, or the actual transfer of the large buffer.

My configure line is:
 ../mvapich2-2.1rc1/configure
--prefix=/home/psath/mvapich2-2.1rc1/build/install --enable-cuda
--disable-mcast

(Run from ~/mvapich2-2.1rc1/build, assuming source is in
~/mvapich2-2.1rc1/mvapich2-2.1rc1/) I am forced to disable multicast as our
dev node doesn't have Infiniband or associated header files. Enabling CUDA
is a requirement for us.

The node is running Ubuntu Linux 3.11.0-14-generic 64-bit and gcc 4.6.4.

We are able to continue debugging our non-GPUDirect fallback code paths
(our own host-staging of buffers) with standard MPICH in the mean time, but
going forward would prefer the performance afforded by sidestepping the
host when possible.

Please let me know if there is any other information I can provide that
would help with diagnosing the issue.

Thanks!

-Paul Sathre
Research Programmer - Synergy Lab
Dept. of Computer Science
Virginia Tech
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150121/641e3cd3/attachment.html>


More information about the mvapich-discuss mailing list