[mvapich-discuss] Strange error with MPI_REDUCE

Dhabaleswar Panda panda at cse.ohio-state.edu
Sat Dec 8 11:31:48 EST 2007


Thanks for reporting this issue. Can you tell us which version of 0.9.9
you are using (the one available with OFED 1.2 or from the OSU site).
Which compiler are you using? Can you also check whether you see the same
problem with the latest MVAPICH 1.0-beta (please use the latest version
from the trunk).

In the mean time, we will also investigate this issue further.

Thanks,

DK


On Fri, 7 Dec 2007, Christian Boehme wrote:

> Dear list,
>
> we recently encountered a strange problem with MPI_REDUCE in our
> mvapich-0.9.9 installation. Please consider the following F77 program:
>
>        program reduce_err
>
>        implicit none
> c FORTRAN MPI-INCLUDE-file
>        include 'mpif.h'
>        integer ierr, nproc, myid
>        real*8  x , y
>
>        call MPI_INIT( ierr )
>        call MPI_COMM_SIZE( MPI_COMM_WORLD, nproc, ierr )
>        call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
>        x = 0
>        y = 1
>        call MPI_REDUCE( y, x, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 1,
>       :                 MPI_COMM_WORLD, ierr )
>        write(6,*) myid, ': Value for x after reduce:', x
>        call MPI_FINALIZE( ierr )
>
>        stop
>        end
>
> Obviously, the output should be the number of processes for myid=1, and
> zero for all other processes. This is also what we get when using either
> one process per node (only Infiniband communication) or put all
> processes on one node (only shared memory):
>
> > mpirun_rsh -np 4 gwdm001 gwdm004 gwdm002 gwdm003 reduce_err
> >            3 : Value for x after reduce:   0.00000000000000
> >            2 : Value for x after reduce:   0.00000000000000
> >            1 : Value for x after reduce:   4.00000000000000
> >            0 : Value for x after reduce:   0.00000000000000
>
> However, when mixing the two, i.e., utilizing several nodes and more
> than one process on those nodes, we also get the number of processes for
> myid=0:
>
> > mpirun_rsh -np 4 gwdm001 gwdm001 gwdm002 gwdm003 reduce_err
> >            1 : Value for x after reduce:   4.00000000000000
> >            2 : Value for x after reduce:   0.00000000000000
> >            3 : Value for x after reduce:   0.00000000000000
> >            0 : Value for x after reduce:   4.00000000000000
>
> This behavior is rather unexpected and can seriously break some
> programs. What could be the problem? Many thanks in advance
>
> Christian Boehme
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list