[mvapich-discuss] mvapich job startup unreliable with slurm and --cpu_bind (patch)

Dhabaleswar Panda panda at cse.ohio-state.edu
Sun Jul 30 23:38:21 EDT 2006


Hi Greg and Mike, 

Many thanks for sending us the patch related to Slurm and --cpu_bind
on July 26th.

You had sent this note to mvapich at cse. Since `mvapich at cse' is an
announcement list only, it got blocked and I just noticed your posting
now.

I am forwarding this note to mvapich-discuss at cse.ohio-state.edu.

As you might have noticed, we just made the release of mvapich 0.9.8.
We will review your patch and incorporate it to the trunk and
0.9.8-branch soon.

May I request to post your future patches to
mvapich-discuss at cse.ohio-state.edu.  Best Regards,

DK

----------------------------------------------------------------

The following patch seems to fix a problem starting mvapich jobs with
slurm and the --cpu_bind option.  Under these conditions, some of the
MPI processes do not make it out of MPI_Init() and the job hangs on
launch.  We think that this is because with slurm and --cpu_bind the
startup is more synchronized.

Thanks,

Greg Johnson & Mike Lang

diff -ur mvapich-0.9.8-rc0.orig/src/context/comm_rdma_init.c mvapich-0.9.8-rc0/src/context/comm_rdma_init.c
--- mvapich-0.9.8-rc0.orig/src/context/comm_rdma_init.c 2006-07-11 16:49:44.000000000 -0600
+++ mvapich-0.9.8-rc0/src/context/comm_rdma_init.c      2006-07-11 15:35:46.000000000 -0600
@@ -162,6 +162,7 @@
 {
 #ifndef CH_GEN2_MRAIL
     int i = 0;
+    int right, left;
     struct Coll_Addr_Exch send_pkt;
     struct Coll_Addr_Exch *recv_pkt;

@@ -188,19 +189,17 @@
 #else
     send_pkt.buf_hndl = comm->collbuf->l_coll->buf_hndl;
 #endif
-
-    for(i = 0; i < comm->np; i++) {
-        /* Don't send to myself */
-        if(i == comm->local_rank) continue;
-
+    right=(comm->local_rank + 1)%comm->np;
+    left=(comm->local_rank + comm->np - 1)%comm->np;
+    for(i=0; i < comm->np-1; i++) {
         MPI_Sendrecv((void*)&send_pkt, sizeof(struct Coll_Addr_Exch),
-                MPI_BYTE, comm->lrank_to_grank[i], ADDR_EXCHANGE_TAG,
-                (void*)&(recv_pkt[i]),sizeof(struct Coll_Addr_Exch),
-                MPI_BYTE, comm->lrank_to_grank[i], ADDR_EXCHANGE_TAG,
+                MPI_BYTE, comm->lrank_to_grank[right], ADDR_EXCHANGE_TAG,
+                (void*)&(recv_pkt[left]),sizeof(struct Coll_Addr_Exch),
+                MPI_BYTE, comm->lrank_to_grank[left], ADDR_EXCHANGE_TAG,
                 MPI_COMM_WORLD, &(statarray[i]));
-        if (statarray[i].MPI_ERROR != MPI_SUCCESS) {
-                fprintf(stderr, "blah! %d %d\n", comm->local_rank, statarray[i].MPI_ERROR);
-        }
+
+       right = (right+1)%comm->np;
+       left = (left + comm->np - 1)%comm->np;
     }

     for(i = 0; i < comm->np; i++) {


More information about the mvapich-discuss mailing list