[mvapich-discuss] hang in MPI_Bcast

Frank Riley fhr at rincon.com
Thu Jun 14 11:52:33 EDT 2012


Hello,

Our app hangs in a MPI_Bcast when run under MVAPICH2 1.8. I'm looking for suggestions on how to debug the problem. I'm running a 2 process, 2 node job. Stopping the hung processes in the debugger, I can see that MVAPICH2 is stuck busy waiting in MPIDI_CH3I_Progress on both the write and the read side of the MPI_Bcast. Note that at the point of the hang, we have already made other MPI send/receive calls and also a few other MPI_Bcast calls that were successful. Based on my debugging so far, I'm starting to think there is a bug in MVAPICH2 that depends on the timing of the calls. Our app starts off as python, which eventually calls into C++ code to run the main algorithm which does the MPI. If I run the app without the python wrapper, it does not hang.

Other interesting things to note:
1) Our app is running MPI_THREAD_SERIALIZED (MV2_ENABLE_AFFINITY=0)
1) intra-node MPI_Bcast does not hang (2 process, 1 node)
2) MPICH2 1.4.1p1 does not hang
3) compiling MVAPICH2 with the "mem" or "memarena" debug options does not hang, all other debug options still result in a hang
4) on MVAPICH2 1.7, the hang also occurs, but on one of the earlier MPI_Bcast calls

Unfortunately, the app and cluster resides on a network not connected to the internet so it is difficult to provide debug output or code. If anyone can provide suggestions on how to correct this I would greatly appreciate it. We would really like to use MVAPICH2.

Thank you,
Frank



More information about the mvapich-discuss mailing list