[mvapich-discuss] mvapich bug maybe when using array slicing, reproducer attached

Panda, Dhabaleswar panda at cse.ohio-state.edu
Tue Dec 31 17:02:41 EST 2013


Hi Ben,

Thanks for providing the additional details on the issue you are seeing. Could you please let us
know the configuration flags you are using. We will take a look at it.

Thanks,

DK

________________________________
From: mvapich-discuss [mvapich-discuss-bounces at cse.ohio-state.edu] on behalf of Ben [Benjamin.M.Auer at nasa.gov]
Sent: Tuesday, December 31, 2013 12:51 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] mvapich bug maybe when using array slicing, reproducer attached

I sent a message earlier about what seemed like an mvapich bug (see below).
On further testing it seems to be some sort of intel 13/mvapich interaction:

I tried my tester with a few other compiler/mvapich combinations that I have available

mvapich 1.8.1 and intel 13.1.3.192 failed
mvapich 1.9 and intel 13.1.3.192 failed

mvapich 1.8.1 and pgi 13.5  works
mvapich 1.9 and gcc 4.8.1 works
openmpi 1.7.3 and intel 13.1.3.192 works



On 12/30/2013 02:11 PM, Ben wrote:
In diagnosing a problem we were having with some new code we came across some strange behaviour with mvapich 2.0a2 and intel 13.1.3.192

Basically we have some worker processes in our job that are buffering 3D variables that once fully received is written out.
The worker processes receive the data one 2D slice at a time in a loop as that is how the data is processed on the sending end.
So we have a loop that is something like this on the receiver side

real, allocatable :: buffer(:,:,:)

allocate buffer

do i=1,nslices
    call MPI_RECV(buffer(:,;,i),datasize,MPI_REAL,sender_rank,tag,MPI_COMM_WORLD,mpistatus,status)
enddo

Above a certain size of the first 2 dimensions of the buffer our code was failing and we traced it to the recv. Somehow despite mpi not returning an error and saying it received the right amount of data the buffer was never written to. I initialized it to some non-zero initial value and the buffer variable sometimes never gets gets touched in the MPI_RECV call.

When I instead did the receive into a 2D buffer and then copied that to the 3D buffer the code worked:

real, allocatable :: buffer(:,:,:)
real, allocatable :: buffer2d(:,:)

allocate buffer and buffer2d

do i=1,nslices
    call MPI_RECV(buffer2d,datasize,MPI_REAL,sender_rank,tag,MPI_COMM_WORLD,mpistatus,status)
    buffer(:,:,i) = buffer2d
enddo


I've made a little tester that I have attached that reproduces this problem. Basically the root process just keeps send data to a worker process in a loop. When I run this and make sure the worker process receiving the data is on a different physical node than the root process sending the data the receive will start failing after a couple iterations of the loops with what I have hard coded in now. This worked with openmpi so I'm wondering if we have uncovered an mvapich bug or putting buffer(:,:,i) in the MPI_RECV call is just not safe?

--
Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
NASA GSFC,  Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
Phone: 301-286-9176               Fax: 301-614-6246




_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss




--
Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
NASA GSFC,  Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
Phone: 301-286-9176               Fax: 301-614-6246

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 7758 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131231/d4333aa1/attachment.bin>


More information about the mvapich-discuss mailing list