[mvapich-discuss] MPI_Alltoall crashes/stalls

Florian Mannuß mannuss at gmx.com
Mon Feb 9 12:12:21 EST 2015


I run Parmetis on our cluster and recognized that it hangs or crashes when using more than ~6000 cores. A debug run showed that MPI_Alltoall is the problem. Using a small test application reproduces the error. However, when running the test application on TACC with 6000 cores no problems appear (MVAPICH2 2.0b & Intel14). I searched through MVAPICH2 source code and found the “MV2_USE_OLD_ALLTOALL” environment variable. Using this solved the problem, but then our simulator hangs in an MPI_Broadcast call and “MV2_USE_OLD_BCAST” does not fix the problem. When debugging both calls (MPI_Alltoall, MPI_Broadcast) seem to hang in a barrier like code segment.
We use MVAPICH2 2.0.1 and use Intel15 compiler. I tried it with the newest MVAPICH2 2.1rc1, but the problem still occurs. Are there any flags for compiling or using MVAPICH2 that solve this kind of problem?
 
Here the code I used for testing:
int main (int argc, char **argv)
{
    // Init MPI, get comm size and nodes id
    int ierr = MPI_Init(&argc, &argv);
    int li_num_nodes, li_myid;
    MPI_Comm_size(MPI_COMM_WORLD, &li_num_nodes);
    MPI_Comm_rank(MPI_COMM_WORLD, &li_myid);
 
    MPI_Comm duplicated_comm;
    MPI_Comm_dup(MPI_COMM_WORLD, &duplicated_comm);
 
    int *send_buffer = new int[li_num_nodes*2];
    for (int i = 0; i < li_num_nodes*2; ++i)
        send_buffer[i] = li_myid;
    int *recv_buffer = new int[li_num_nodes*2];
    memset(recv_buffer, 0, sizeof(int) * li_num_nodes * 2);
 
    MPI_Alltoall((void*)send_buffer, 2, MPI_INT, (void*)recv_buffer, 2, MPI_INT, duplicated_comm);
 
 
    MPI_Finalize();
    return 1;
}
 
 
Thanks,
Florian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150209/96c339c4/attachment.html>


More information about the mvapich-discuss mailing list