[mvapich-discuss] MPI_Reduce(MPI_SUM) order

Mon Dec 7 03:55:02 EST 2015

I wrote a little stand-alone C++ program to try and narrow the issue. It 
performs a tight loop of MPI_Reduce(float[K], ... MPI_SUM, ...) with 
K=64, invoked with identical parameters of well-formed floats. It 
compares the result of later iterations with the first result. Since 
MPI_Reduce is deterministic in its spanning tree, these should be 
bit-identical.

The program ran as an application on 5 machines, one thread/process per 
machine. After 16 iterations, the result differs from the first result. 
Similar to my issue reported below, the difference seems approx. in the 
7th significant digit in quite a number of array fields -- this might 
even count for 'floating point correct', but I suspect that is an 
artifact rather than a feature.

Conclusion: MPI_Reduce is /not/ deterministic, even within one run. 
Since you explain that it should be deterministic, my guess is that some 
interal MVapich2 state gets corrupted (and I don't see reasons to 
primarily suspect the spanning tree).

Should I post my code for ease of debugging? Are there other things that 
I can do?

Rutger

On 12/04/2015 11:58 PM, Hari Subramoni wrote:
> Hello,
>
> If the configuration chosen to run the MPI job is retained then MVAPICH2
> retains the order of operations and hence no non-determinism exists.
> This makes reduction operations bitwise reproducible.
>
> However, the same guarantees can't be made if the job is first run as 2
> nodes with 4 processes per node and then as 4 nodes with 2 processes per
> node. Further, no guarantees can be made across multiple sets of
> machines due to the inherent non-determinism related to the results of
> floating point operations at very high precision.
>
> Hope this helps.
>
> Regards,
> Hari.
>
> On Fri, Dec 4, 2015 at 4:11 AM, Rutger Hofman <rutger at cs.vu.nl
> <mailto:rutger at cs.vu.nl>> wrote:
>
>     Good morning,
>
>     my application uses MVapich2 (locally labeled mvapich2/gcc/64/2.0b)
>     over Infiniband in a CentOS cluster. I notice the following. When I
>     repeatedly run the application, the result of an MPI_Reduce(...,
>     MPI_FLOAT, ..., MPI_SUM) over an array of floats may be different
>     over various runs, although the inputs are exactly the same (I
>     checked the bit patterns of the floats), the number of machines is
>     the same, etc etc. The actual machines allocated, and the connection
>     to the switches, may be different over runs -- I didn't try to fix
>     the machine allocation within the cluster. The difference in the
>     reduce results is at most small, in the order of magnitude that one
>     would expect if the summation is carried out in a different order.
>
>     My question: is it possible with MVapich2 that the internal order of
>     the reduce operations is different, even if the number of machines
>     is equal? Is it easy/possible to enforce a fixed order in the reduce
>     implementation, just to verify this? Or should I suspect some bug of
>     my own, like some weird memory corruption? My application also uses
>     RDMA verbs natively; in principle that should work fine.
>
>     Thank you for your advice,
>
>     Rutger Hofman
>     VU Amsterdam DAS5 http://www.cs.vu.nl/das5
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>