[mvapich-discuss] Segfault and corruption with RPUT/RGET

Jian Lin lin.2180 at osu.edu
Wed Mar 25 22:25:19 EDT 2015


Hi, Alessandro,

Thanks for your notes!

We tested your reproducer, but we cannot reproduce the exact issue you
mentioned. 

In your reproducer, the send buffer is malloced continuously and freed
at the callback. It will take a large amount of memory. In an
environment with memory limitation, we noticed that the malloc at
bug_test.cpp:81 may return NULL and cause subsequent segment fault.
However, we didn't see the other error information you mentioned. Also,
"-env MV2_RNDV_PROTOCOL=R3" seems helpless for this out-of-memory issue.

Will you please check the memory usage in your application? If you 
think this issue is related MVAPICH2, can you please show more detailed
information of the failure, or provide a simpler reproducer? Thank you!

On Mon, 23 Mar 2015 14:53:13 +0000
"Dovis  Alessandro" <adovis at student.ethz.ch> wrote:

> Errata corrige: Running the same code with the option "-env
> MV2_RNDV_PROTOCOL=R3", the issue *disappears*.
> ________________________________________ From: Dovis  Alessandro
> Sent: Monday, March 23, 2015 3:47 PM
> To: mvapich-discuss at cse.ohio-state.edu
> Subject: Segfault and corruption with RPUT/RGET
> 
> Hello everyone,
> 
> we are using MVAPICH2 2.1rc2 on a cluster with Infiniband (cards:
> Mellanox ConnectX-3 Single Port [with VPI]; switches: Mellanox
> SX6018).
> 
> You can find - at the following link - a simplified version of the
> part of our system that experiences the issue:
> https://github.com/dovix91/mvapich2-issue
> 
> Running the binary with a command like the following, causes a
> segmentation fault after some iterations of the sending
> loop: /opt/mvapich2-2.1rc2/bin/mpiexec --host machine1,machine2 -n
> 2 ./bug_test 0 5 2 Furthermore the results are corrupted at the
> receiving side: this can be seen by the message footer (that we use
> at the application level), set at lines 82-83 of bug_test.cpp and
> printed in net/mpi/NetworkManagerMPIBlockingMultithread.cpp; the
> receiver prints many times "MPI recv: batch number 0, sequence number
> 0", which is not a valid batch number (the sender prints only
> positive batch numbers, correctly).
> 
> Running the same code with the option "-env MV2_RNDV_PROTOCOL=R3" the
> issue doesn't disappears.
> 
> Thank you very much for the help.
> 
> Best,
> Alessandro Dovis
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



-- 
Jian Lin
http://linjian.org



More information about the mvapich-discuss mailing list