[mvapich-discuss] MPI_{Send, Recv} Cuda buffer not actually synchronous?

Steven Eliuk s.eliuk at samsung.com
Mon Nov 17 12:49:21 EST 2014


Sure, I have someone preparing a small test program.

Here is a question for you, this is strange…

If we have GDR enabled and run on a single node, with one master and two slaves processes, we can reproduce the issue. However, there should be no IB fabric being used… obviously, cause we are on a single node and the IPC peer route should be taken. If we disable the GDR, i.e. MV2_USE_GPUDIRECT = 0, then our test passes and we no early posting of a sync recv.

This doesn’t make much sense, can you provide some insight?

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976,
Work: +1 408-544-5781 Wednesdays,
Cell: +1 408-819-4407.


From: Akshay Venkatesh <akshay.v.3.14 at gmail.com<mailto:akshay.v.3.14 at gmail.com>>
Date: Saturday, November 15, 2014 at 11:39 AM
To: Steven Eliuk - SISA <s.eliuk at samsung.com<mailto:s.eliuk at samsung.com>>
Cc: "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MPI_{Send, Recv} Cuda buffer not actually synchronous?


Hi Steven,

Would it possible to share a reproducer so that we can check if there's a bug locally? A simple code snippet will suffice too.

Thanks

On Nov 14, 2014 11:08 PM, "Steven Eliuk" <s.eliuk at samsung.com<mailto:s.eliuk at samsung.com>> wrote:
Hi all,

We have noticed some strange behavior on MPI{Send, Recv} pair where the master sends data located in a host buffer to a slave’s GPU direct buffer. Now, initially we believed it was only in distributed multi-node fashion but have since narrowed it down to very simple case where everything resides on one node, e.g. Master, with two slaves.

Do  you have a more detailed change log from 2.0b-gdr -> 2.0 ? As 2.0 seems to fix the most basic test we can reproduce this in but we have more complicated tests that show the same behavior. We are hoping to track it down, seems as though you are posting a little earlier the sync recv has actually completed… when in fact it hasn’t.

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976<tel:%2B1%20408-652-1976>,
Work: +1 408-544-5781<tel:%2B1%20408-544-5781> Wednesdays,
Cell: +1 408-819-4407<tel:%2B1%20408-819-4407>.


_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141117/450e0e65/attachment.html>


More information about the mvapich-discuss mailing list