[mvapich-discuss] MPI_Alltoall crashes/stalls

Hari Subramoni subramoni.1 at osu.edu
Mon Feb 9 12:46:36 EST 2015


Hi Florian,

Thanks for the report. We are taking a look at it. In the mean time, can
you try with MV2_USE_SLOT_SHMEM_COLL=0. If that doesn't work, can you
please try MV2_USE_SHMEM_COLL=0?

Regards,
Hari.

On Mon, Feb 9, 2015 at 12:12 PM, Florian Mannuß <mannuss at gmx.com> wrote:

> I run Parmetis on our cluster and recognized that it hangs or crashes when
> using more than ~6000 cores. A debug run showed that MPI_Alltoall is the
> problem. Using a small test application reproduces the error. However, when
> running the test application on TACC with 6000 cores no problems appear
> (MVAPICH2 2.0b & Intel14). I searched through MVAPICH2 source code and
> found the “MV2_USE_OLD_ALLTOALL” environment variable. Using this solved
> the problem, but then our simulator hangs in an MPI_Broadcast call and
> “MV2_USE_OLD_BCAST” does not fix the problem. When debugging both calls
> (MPI_Alltoall, MPI_Broadcast) seem to hang in a barrier like code segment.
> We use MVAPICH2 2.0.1 and use Intel15 compiler. I tried it with the newest
> MVAPICH2 2.1rc1, but the problem still occurs. Are there any flags for
> compiling or using MVAPICH2 that solve this kind of problem?
>
> Here the code I used for testing:
> int main (int argc, char **argv)
> {
>     // Init MPI, get comm size and nodes id
>     int ierr = MPI_Init(&argc, &argv);
>     int li_num_nodes, li_myid;
>     MPI_Comm_size(MPI_COMM_WORLD, &li_num_nodes);
>     MPI_Comm_rank(MPI_COMM_WORLD, &li_myid);
>
>     MPI_Comm duplicated_comm;
>     MPI_Comm_dup(MPI_COMM_WORLD, &duplicated_comm);
>
>     int *send_buffer = new int[li_num_nodes*2];
>     for (int i = 0; i < li_num_nodes*2; ++i)
>         send_buffer[i] = li_myid;
>     int *recv_buffer = new int[li_num_nodes*2];
>     memset(recv_buffer, 0, sizeof(int) * li_num_nodes * 2);
>
>     MPI_Alltoall((void*)send_buffer, 2, MPI_INT, (void*)recv_buffer, 2,
> MPI_INT, duplicated_comm);
>
>
>     MPI_Finalize();
>     return 1;
> }
>
>
> Thanks,
> Florian
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150209/44eb7e4a/attachment.html>


More information about the mvapich-discuss mailing list