[mvapich-discuss] MPI_{Send, Recv} Cuda buffer not actually synchronous?

Wed Nov 26 13:01:47 EST 2014

The master can reside on a node with no GPU, typically does on our distributed runs. I mentioned this previously where the pointer on the master is a ‘host memory pointer’ where as on the slaves it typically is a ‘device memory pointer’.

Please let me know if I can help more, that code was reconstructed automatically from our logging class so it is not so readable, apologies there. We tried to give a pretty simple example,

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976,
Work: +1 408-544-5781 Wednesdays,
Cell: +1 408-819-4407.

From: Akshay Venkatesh <akshay at cse.ohio-state.edu<mailto:akshay at cse.ohio-state.edu>>
Date: Wednesday, November 26, 2014 at 9:55 AM
To: Steven Eliuk - SISA <s.eliuk at samsung.com<mailto:s.eliuk at samsung.com>>
Cc: "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MPI_{Send, Recv} Cuda buffer not actually synchronous?

Hi Steven,

I need a small clarification in the master code. The following code snippet in rank0 section indicates that the master is to send to rank1 and rank2 from a GPU buffer. However, the buffers are declared in CPU memory. Is this intended? (in fact there are no cudamalloc calls in the master code where as they do exist in the slave codes)

  // GPU -> CPU
  MPI_Request *req49 = new MPI_Request[6];
  unsigned *async_buf49 = new unsigned[1];
  unsigned *async_buf52 = new unsigned[1];
  unsigned *async_buf55 = new unsigned[1];
  unsigned *async_buf58 = new unsigned[1];
  unsigned *async_buf61 = new unsigned[1];
  unsigned *async_buf64 = new unsigned[1];

  MPI_Request *req50 = new MPI_Request[6];
  long long *async_buf50 = new long long[1];
  long long *async_buf53 = new long long[1];
  long long *async_buf56 = new long long[1];
  long long *async_buf59 = new long long[1];
  long long *async_buf62 = new long long[1];
  long long *async_buf65 = new long long[1];

  MPI_Request *req51 = new MPI_Request[6];
  unsigned char *async_buf51 = new unsigned char[292];
  unsigned char *async_buf54 = new unsigned char[292];
  unsigned char *async_buf57 = new unsigned char[292];
  unsigned char *async_buf60 = new unsigned char[292];
  unsigned char *async_buf63 = new unsigned char[292];
  unsigned char *async_buf66 = new unsigned char[216];

  MPI_Isend(async_buf49, 1, MPI_UNSIGNED, 1, 32768, MPI_COMM_WORLD, &req49[0]);
  MPI_Isend(async_buf50, 1, MPI_LONG_LONG, 1, 1, MPI_COMM_WORLD, &req50[0]);
  MPI_Irecv(async_buf51, 292, MPI_UNSIGNED_CHAR, 1, 2, MPI_COMM_WORLD, &req51[0]);
  MPI_Isend(async_buf52, 1, MPI_UNSIGNED, 2, 32768, MPI_COMM_WORLD, &req49[1]);
  MPI_Isend(async_buf53, 1, MPI_LONG_LONG, 2, 1, MPI_COMM_WORLD, &req50[1]);
  MPI_Irecv(async_buf54, 292, MPI_UNSIGNED_CHAR, 2, 2, MPI_COMM_WORLD, &req51[1]);

Let me know. Thanks

On Wed, Nov 26, 2014 at 12:10 PM, Steven Eliuk <s.eliuk at samsung.com<mailto:s.eliuk at samsung.com>> wrote:
Anyone willing to look at this? Seems like a major issue, we have reverted to copying off of the GPU to host memory and sending / recv into that buffer.

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976<tel:%2B1%20408-652-1976>,
Work: +1 408-544-5781<tel:%2B1%20408-544-5781> Wednesdays,
Cell: +1 408-819-4407<tel:%2B1%20408-819-4407>.

From: Steven Eliuk - SISA <s.eliuk at samsung.com<mailto:s.eliuk at samsung.com>>
Date: Wednesday, November 19, 2014 at 10:18 AM

To: Akshay Venkatesh <akshay.v.3.14 at gmail.com<mailto:akshay.v.3.14 at gmail.com>>
Cc: "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MPI_{Send, Recv} Cuda buffer not actually synchronous?

Hello all,

Here is an example program that was generated automatically via our log files, so not very readable. We can’t share our large framework / library at this time. You will notice in the 60.txt file part way through the second run the index resets, when it should not. If you do a diff on the two files, 60.txt & 65.txt, you can see where.

The example program reproduces the error in the the 2.0b-gdr version with cuda 6.0, but it has troubles reproducing in the newest 2.0-GDR and CUDA 6.5. However, I can assure you that it is a real problem in 2.0-GDR and cuda 6.5 but the tests to reproduce it is very complex.

Some details,
-seems as though you are posting early on synchronous recv (host mem -> GPU mem) that the recv is actually completed when in fact it has not, this occurs more frequently than the later case.
-also sync sends (GPU mem -> host mem), this happens very rarely but does happen.

Ironically, this never happens on async calls, we have tested them thoroughly. Likewise, OpenMPI works perfectly with both cuda 6.0 and cuda 6.5 so I doubt the driver (340.32), or cuda libs, are the problem.

This is using two processes, master and two slaves on the same machine with a Nvidia k40. Also happens when running in distributed mode,

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976<tel:%2B1%20408-652-1976>,
Work: +1 408-544-5781<tel:%2B1%20408-544-5781> Wednesdays,
Cell: +1 408-819-4407<tel:%2B1%20408-819-4407>.

From: Steven Eliuk - SISA <s.eliuk at samsung.com<mailto:s.eliuk at samsung.com>>
Date: Monday, November 17, 2014 at 11:01 AM
To: Akshay Venkatesh <akshay.v.3.14 at gmail.com<mailto:akshay.v.3.14 at gmail.com>>
Cc: "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MPI_{Send, Recv} Cuda buffer not actually synchronous?

Ive included two simple nvprofs, one with GDR and one without… the strangeness is that you will notice the GDR code uses no async calls where the code with GDR disabled does. The code paths should be identical because everything resides on the same machine, no distributed run.

Any explanation for this?

We should have the code prepared shortly that reproduces the issue,

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976<tel:%2B1%20408-652-1976>,
Work: +1 408-544-5781<tel:%2B1%20408-544-5781> Wednesdays,
Cell: +1 408-819-4407<tel:%2B1%20408-819-4407>.

From: Steven Eliuk - SISA <s.eliuk at samsung.com<mailto:s.eliuk at samsung.com>>
Date: Monday, November 17, 2014 at 9:49 AM
To: Akshay Venkatesh <akshay.v.3.14 at gmail.com<mailto:akshay.v.3.14 at gmail.com>>
Cc: "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MPI_{Send, Recv} Cuda buffer not actually synchronous?

Sure, I have someone preparing a small test program.

Here is a question for you, this is strange…

If we have GDR enabled and run on a single node, with one master and two slaves processes, we can reproduce the issue. However, there should be no IB fabric being used… obviously, cause we are on a single node and the IPC peer route should be taken. If we disable the GDR, i.e. MV2_USE_GPUDIRECT = 0, then our test passes and we no early posting of a sync recv.

This doesn’t make much sense, can you provide some insight?

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976<tel:%2B1%20408-652-1976>,
Work: +1 408-544-5781<tel:%2B1%20408-544-5781> Wednesdays,
Cell: +1 408-819-4407<tel:%2B1%20408-819-4407>.

From: Akshay Venkatesh <akshay.v.3.14 at gmail.com<mailto:akshay.v.3.14 at gmail.com>>
Date: Saturday, November 15, 2014 at 11:39 AM
To: Steven Eliuk - SISA <s.eliuk at samsung.com<mailto:s.eliuk at samsung.com>>
Cc: "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MPI_{Send, Recv} Cuda buffer not actually synchronous?

Hi Steven,

Would it possible to share a reproducer so that we can check if there's a bug locally? A simple code snippet will suffice too.

Thanks

On Nov 14, 2014 11:08 PM, "Steven Eliuk" <s.eliuk at samsung.com<mailto:s.eliuk at samsung.com>> wrote:
Hi all,

We have noticed some strange behavior on MPI{Send, Recv} pair where the master sends data located in a host buffer to a slave’s GPU direct buffer. Now, initially we believed it was only in distributed multi-node fashion but have since narrowed it down to very simple case where everything resides on one node, e.g. Master, with two slaves.

Do  you have a more detailed change log from 2.0b-gdr -> 2.0 ? As 2.0 seems to fix the most basic test we can reproduce this in but we have more complicated tests that show the same behavior. We are hoping to track it down, seems as though you are posting a little earlier the sync recv has actually completed… when in fact it hasn’t.

Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976<tel:%2B1%20408-652-1976>,
Work: +1 408-544-5781<tel:%2B1%20408-544-5781> Wednesdays,
Cell: +1 408-819-4407<tel:%2B1%20408-819-4407>.

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

--
- Akshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141126/435fd34c/attachment-0001.html>