[mvapich-discuss] MPI_Reduce(MPI_SUM) order

Rutger Hofman rutger at cs.vu.nl
Mon Dec 7 04:30:22 EST 2015


Update: when run on 3 machines, iteration 2340 (counting starts at 0) 
gives a different result. On 4 machines, iteration 16 gives a different 
result, the same as 5 machines. On 2 machines, I ran 1000000 iterations 
without error.

Rutger Hofman
VU Amsterdam
http://www.cs.vu.nl/das5

On 12/07/2015 09:55 AM, Rutger Hofman wrote:
> I wrote a little stand-alone C++ program to try and narrow the issue. It
> performs a tight loop of MPI_Reduce(float[K], ... MPI_SUM, ...) with
> K=64, invoked with identical parameters of well-formed floats. It
> compares the result of later iterations with the first result. Since
> MPI_Reduce is deterministic in its spanning tree, these should be
> bit-identical.
>
> The program ran as an application on 5 machines, one thread/process per
> machine. After 16 iterations, the result differs from the first result.
> Similar to my issue reported below, the difference seems approx. in the
> 7th significant digit in quite a number of array fields -- this might
> even count for 'floating point correct', but I suspect that is an
> artifact rather than a feature.
>
> Conclusion: MPI_Reduce is /not/ deterministic, even within one run.
> Since you explain that it should be deterministic, my guess is that some
> interal MVapich2 state gets corrupted (and I don't see reasons to
> primarily suspect the spanning tree).
>
> Should I post my code for ease of debugging? Are there other things that
> I can do?
>
> Rutger
>
> On 12/04/2015 11:58 PM, Hari Subramoni wrote:
>> Hello,
>>
>> If the configuration chosen to run the MPI job is retained then MVAPICH2
>> retains the order of operations and hence no non-determinism exists.
>> This makes reduction operations bitwise reproducible.
>>
>> However, the same guarantees can't be made if the job is first run as 2
>> nodes with 4 processes per node and then as 4 nodes with 2 processes per
>> node. Further, no guarantees can be made across multiple sets of
>> machines due to the inherent non-determinism related to the results of
>> floating point operations at very high precision.
>>
>> Hope this helps.
>>
>> Regards,
>> Hari.
>>
>> On Fri, Dec 4, 2015 at 4:11 AM, Rutger Hofman <rutger at cs.vu.nl
>> <mailto:rutger at cs.vu.nl>> wrote:
>>
>>     Good morning,
>>
>>     my application uses MVapich2 (locally labeled mvapich2/gcc/64/2.0b)
>>     over Infiniband in a CentOS cluster. I notice the following. When I
>>     repeatedly run the application, the result of an MPI_Reduce(...,
>>     MPI_FLOAT, ..., MPI_SUM) over an array of floats may be different
>>     over various runs, although the inputs are exactly the same (I
>>     checked the bit patterns of the floats), the number of machines is
>>     the same, etc etc. The actual machines allocated, and the connection
>>     to the switches, may be different over runs -- I didn't try to fix
>>     the machine allocation within the cluster. The difference in the
>>     reduce results is at most small, in the order of magnitude that one
>>     would expect if the summation is carried out in a different order.
>>
>>     My question: is it possible with MVapich2 that the internal order of
>>     the reduce operations is different, even if the number of machines
>>     is equal? Is it easy/possible to enforce a fixed order in the reduce
>>     implementation, just to verify this? Or should I suspect some bug of
>>     my own, like some weird memory corruption? My application also uses
>>     RDMA verbs natively; in principle that should work fine.
>>
>>     Thank you for your advice,
>>
>>     Rutger Hofman
>>     VU Amsterdam DAS5 http://www.cs.vu.nl/das5
>>     _______________________________________________
>>     mvapich-discuss mailing list
>>     mvapich-discuss at cse.ohio-state.edu
>>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>



More information about the mvapich-discuss mailing list