[mvapich-discuss] Strange error with MPI_REDUCE

amith rajith mamidala mamidala at cse.ohio-state.edu
Sun Dec 9 16:03:56 EST 2007


Hi Christian,

Can you also try the patch I am attaching with this mail and let us know
how it works?

Thanks,
Amith.


On Sat, 8 Dec 2007, Dhabaleswar Panda wrote:

> Thanks for reporting this issue. Can you tell us which version of 0.9.9
> you are using (the one available with OFED 1.2 or from the OSU site).
> Which compiler are you using? Can you also check whether you see the same
> problem with the latest MVAPICH 1.0-beta (please use the latest version
> from the trunk).
>
> In the mean time, we will also investigate this issue further.
>
> Thanks,
>
> DK
>
>
> On Fri, 7 Dec 2007, Christian Boehme wrote:
>
> > Dear list,
> >
> > we recently encountered a strange problem with MPI_REDUCE in our
> > mvapich-0.9.9 installation. Please consider the following F77 program:
> >
> >        program reduce_err
> >
> >        implicit none
> > c FORTRAN MPI-INCLUDE-file
> >        include 'mpif.h'
> >        integer ierr, nproc, myid
> >        real*8  x , y
> >
> >        call MPI_INIT( ierr )
> >        call MPI_COMM_SIZE( MPI_COMM_WORLD, nproc, ierr )
> >        call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
> >        x = 0
> >        y = 1
> >        call MPI_REDUCE( y, x, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 1,
> >       :                 MPI_COMM_WORLD, ierr )
> >        write(6,*) myid, ': Value for x after reduce:', x
> >        call MPI_FINALIZE( ierr )
> >
> >        stop
> >        end
> >
> > Obviously, the output should be the number of processes for myid=1, and
> > zero for all other processes. This is also what we get when using either
> > one process per node (only Infiniband communication) or put all
> > processes on one node (only shared memory):
> >
> > > mpirun_rsh -np 4 gwdm001 gwdm004 gwdm002 gwdm003 reduce_err
> > >            3 : Value for x after reduce:   0.00000000000000
> > >            2 : Value for x after reduce:   0.00000000000000
> > >            1 : Value for x after reduce:   4.00000000000000
> > >            0 : Value for x after reduce:   0.00000000000000
> >
> > However, when mixing the two, i.e., utilizing several nodes and more
> > than one process on those nodes, we also get the number of processes for
> > myid=0:
> >
> > > mpirun_rsh -np 4 gwdm001 gwdm001 gwdm002 gwdm003 reduce_err
> > >            1 : Value for x after reduce:   4.00000000000000
> > >            2 : Value for x after reduce:   0.00000000000000
> > >            3 : Value for x after reduce:   0.00000000000000
> > >            0 : Value for x after reduce:   4.00000000000000
> >
> > This behavior is rather unexpected and can seriously break some
> > programs. What could be the problem? Many thanks in advance
> >
> > Christian Boehme
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
Index: intra_fns_new.c
===================================================================
--- intra_fns_new.c	(revision 1650)
+++ intra_fns_new.c	(working copy)
@@ -5074,7 +5074,7 @@
     MPI_Comm shmem_comm, leader_comm;
     struct MPIR_COMMUNICATOR *comm_ptr = 0,*shmem_commptr = 0, *leader_commptr = 0;
     int local_rank = -1, global_rank = -1, local_size=0, my_rank;
-    void* local_buf=NULL, *tmpbuf=NULL;
+    void* local_buf=NULL, *tmpbuf=NULL, *tmpbuf1=NULL;
     int stride = 0, i, is_commutative;
     int leader_root, total_size=0, shmem_comm_rank;
 
@@ -5156,6 +5156,11 @@
                     MPIR_REDUCE_TAG, comm_ptr->self, &status);
         }
 
+        if (local_rank == 0){
+            MPIR_ALLOC(tmpbuf1, MALLOC(count*extent), comm_ptr, MPI_ERR_EXHAUSTED, myname);
+            tmpbuf1 = (void *)((char*)tmpbuf1 - lb);
+        }
+
         if (local_size > 1){
             MPID_SHMEM_COLL_GetShmemBuf(local_size, local_rank, shmem_comm_rank, &shmem_buf);
         }
@@ -5176,11 +5181,11 @@
             leader_root = comm_ptr->leader_rank[leader_of_root];
             if (local_size != total_size){
                 if (local_size > 1){
-                    mpi_errno = intra_Reduce(tmpbuf, recvbuf, count, datatype,
+                    mpi_errno = intra_Reduce(tmpbuf, tmpbuf1, count, datatype,
                             op, leader_root, leader_commptr);
                 }
                 else{
-                    mpi_errno = intra_Reduce(sendbuf, recvbuf, count, datatype,
+                    mpi_errno = intra_Reduce(sendbuf, tmpbuf1, count, datatype,
                             op, leader_root, leader_commptr);
                 }
             }
@@ -5207,19 +5212,27 @@
             MPID_SHMEM_COLL_SetGatherComplete(local_size, local_rank, shmem_comm_rank);
         }
 
+        if ((local_rank == 0) && (root == my_rank)){
+            mpi_errno = MPI_Sendrecv(tmpbuf1, count, datatype->self, rank,
+                    MPIR_REDUCE_TAG, recvbuf, count, datatype->self, rank, 
+                    MPIR_REDUCE_TAG, comm_ptr->self, &status);
+            return MPI_SUCCESS;
 
+        }
+
         /* Copying data from leader to the root incase
          * leader is not the root */
         if (local_size > 1){
             /* Send the message to the  root if the leader is not the
              * root of the reduce operation */
+
             if ((local_rank == 0) && (root != my_rank) && (leader_root == global_rank)){
                 if (local_size == total_size){
                     mpi_errno  = MPI_Send( tmpbuf, count, datatype->self, root,
                             MPIR_REDUCE_TAG, comm->self );
                 }
                 else{
-                    mpi_errno  = MPI_Send( recvbuf, count, datatype->self, root,
+                    mpi_errno  = MPI_Send( tmpbuf1, count, datatype->self, root,
                             MPIR_REDUCE_TAG, comm->self );
                 }
             }


More information about the mvapich-discuss mailing list