[mvapich-discuss] Segfault and corruption with RPUT/RGET

Dovis Alessandro adovis at student.ethz.ch
Mon Mar 23 10:47:32 EDT 2015


Hello everyone,

we are using MVAPICH2 2.1rc2 on a cluster with Infiniband (cards: Mellanox ConnectX-3 Single Port [with VPI]; switches: Mellanox SX6018).

You can find - at the following link - a simplified version of the part of our system that experiences the issue:
https://github.com/dovix91/mvapich2-issue

Running the binary with a command like the following, causes a segmentation fault after some iterations of the sending loop:
/opt/mvapich2-2.1rc2/bin/mpiexec --host machine1,machine2 -n 2 ./bug_test 0 5 2
Furthermore the results are corrupted at the receiving side: this can be seen by the message footer (that we use at the application level), set at lines 82-83 of bug_test.cpp and printed in net/mpi/NetworkManagerMPIBlockingMultithread.cpp; the receiver prints many times "MPI recv: batch number 0, sequence number 0", which is not a valid batch number (the sender prints only positive batch numbers, correctly).

Running the same code with the option "-env MV2_RNDV_PROTOCOL=R3" the issue doesn't disappears.

Thank you very much for the help.

Best,
Alessandro Dovis


More information about the mvapich-discuss mailing list