[mvapich-discuss] mvapich bug maybe when using array slicing, reproducer attached

Ben Benjamin.M.Auer at nasa.gov
Tue Dec 31 12:51:12 EST 2013


I sent a message earlier about what seemed like an mvapich bug (see below).
On further testing it seems to be some sort of intel 13/mvapich interaction:

I tried my tester with a few other compiler/mvapich combinations that I 
have available

mvapich 1.8.1 and intel 13.1.3.192 failed
mvapich 1.9 and intel 13.1.3.192 failed

mvapich 1.8.1 and pgi 13.5  works
mvapich 1.9 and gcc 4.8.1 works
openmpi 1.7.3 and intel 13.1.3.192 works



On 12/30/2013 02:11 PM, Ben wrote:
> In diagnosing a problem we were having with some new code we came 
> across some strange behaviour with mvapich 2.0a2 and intel 13.1.3.192
>
> Basically we have some worker processes in our job that are buffering 
> 3D variables that once fully received is written out.
> The worker processes receive the data one 2D slice at a time in a loop 
> as that is how the data is processed on the sending end.
> So we have a loop that is something like this on the receiver side
>
> real, allocatable :: buffer(:,:,:)
>
> allocate buffer
>
> do i=1,nslices
>     call 
> MPI_RECV(buffer(:,;,i),datasize,MPI_REAL,sender_rank,tag,MPI_COMM_WORLD,mpistatus,status)
> enddo
>
> Above a certain size of the first 2 dimensions of the buffer our code 
> was failing and we traced it to the recv. Somehow despite mpi not 
> returning an error and saying it received the right amount of data the 
> buffer was never written to. I initialized it to some non-zero initial 
> value and the buffer variable sometimes never gets gets touched in the 
> MPI_RECV call.
>
> When I instead did the receive into a 2D buffer and then copied that 
> to the 3D buffer the code worked:
>
> real, allocatable :: buffer(:,:,:)
> real, allocatable :: buffer2d(:,:)
>
> allocate buffer and buffer2d
>
> do i=1,nslices
>     call 
> MPI_RECV(buffer2d,datasize,MPI_REAL,sender_rank,tag,MPI_COMM_WORLD,mpistatus,status)
>     buffer(:,:,i) = buffer2d
> enddo
>
>
> I've made a little tester that I have attached that reproduces this 
> problem. Basically the root process just keeps send data to a worker 
> process in a loop. When I run this and _make sure the worker process 
> receiving the data is on a different physical node than the root 
> process sending the data_ the receive will start failing after a 
> couple iterations of the loops with what I have hard coded in now. 
> This worked with openmpi so I'm wondering if we have uncovered an 
> mvapich bug or putting buffer(:,:,i) in the MPI_RECV call is just not 
> safe?
> -- 
> Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
> NASA GSFC,  Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
> Phone: 301-286-9176               Fax: 301-614-6246
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-- 
Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
NASA GSFC,  Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
Phone: 301-286-9176               Fax: 301-614-6246

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131231/03b01217/attachment.html>


More information about the mvapich-discuss mailing list