[mvapich-discuss] MPI_Reduce(MPI_SUM) order
Rutger Hofman
rutger at cs.vu.nl
Mon Dec 7 10:35:16 EST 2015
Please find attached main.cc. I use mpirun_rsh -ssh
MV2_ENABLE_AFFINITY=0, I don't know if that makes any difference -- it
shouldn't for one singlethreaded process per machine.
The float values have been imported as hex bit patterns that I dumped
from my application, so as to exclude any ascii <-> float conversion
issues. The bit pattern is then casted back to floats.
The application takes 1 parameter: the number of iterations.
Rutger
On 12/07/2015 04:11 PM, Hari Subramoni wrote:
> Hi,
>
> I tried a similar program locally and was not able to see the issue you
> mentioned. We did not see any validation errors. Could you please share
> your reproducer with us so that we can try that out also?
>
> Thx,
> Hari,
>
> On Mon, Dec 7, 2015 at 4:30 AM, Rutger Hofman <rutger at cs.vu.nl
> <mailto:rutger at cs.vu.nl>> wrote:
>
> Update: when run on 3 machines, iteration 2340 (counting starts at
> 0) gives a different result. On 4 machines, iteration 16 gives a
> different result, the same as 5 machines. On 2 machines, I ran
> 1000000 iterations without error.
>
> Rutger Hofman
> VU Amsterdam
> http://www.cs.vu.nl/das5
>
>
> On 12/07/2015 09:55 AM, Rutger Hofman wrote:
>
> I wrote a little stand-alone C++ program to try and narrow the
> issue. It
> performs a tight loop of MPI_Reduce(float[K], ... MPI_SUM, ...) with
> K=64, invoked with identical parameters of well-formed floats. It
> compares the result of later iterations with the first result. Since
> MPI_Reduce is deterministic in its spanning tree, these should be
> bit-identical.
>
> The program ran as an application on 5 machines, one
> thread/process per
> machine. After 16 iterations, the result differs from the first
> result.
> Similar to my issue reported below, the difference seems approx.
> in the
> 7th significant digit in quite a number of array fields -- this
> might
> even count for 'floating point correct', but I suspect that is an
> artifact rather than a feature.
>
> Conclusion: MPI_Reduce is /not/ deterministic, even within one run.
> Since you explain that it should be deterministic, my guess is
> that some
> interal MVapich2 state gets corrupted (and I don't see reasons to
> primarily suspect the spanning tree).
>
> Should I post my code for ease of debugging? Are there other
> things that
> I can do?
>
> Rutger
>
> On 12/04/2015 11:58 PM, Hari Subramoni wrote:
>
> Hello,
>
> If the configuration chosen to run the MPI job is retained
> then MVAPICH2
> retains the order of operations and hence no non-determinism
> exists.
> This makes reduction operations bitwise reproducible.
>
> However, the same guarantees can't be made if the job is
> first run as 2
> nodes with 4 processes per node and then as 4 nodes with 2
> processes per
> node. Further, no guarantees can be made across multiple sets of
> machines due to the inherent non-determinism related to the
> results of
> floating point operations at very high precision.
>
> Hope this helps.
>
> Regards,
> Hari.
>
> On Fri, Dec 4, 2015 at 4:11 AM, Rutger Hofman
> <rutger at cs.vu.nl <mailto:rutger at cs.vu.nl>
> <mailto:rutger at cs.vu.nl <mailto:rutger at cs.vu.nl>>> wrote:
>
> Good morning,
>
> my application uses MVapich2 (locally labeled
> mvapich2/gcc/64/2.0b)
> over Infiniband in a CentOS cluster. I notice the
> following. When I
> repeatedly run the application, the result of an
> MPI_Reduce(...,
> MPI_FLOAT, ..., MPI_SUM) over an array of floats may be
> different
> over various runs, although the inputs are exactly the
> same (I
> checked the bit patterns of the floats), the number of
> machines is
> the same, etc etc. The actual machines allocated, and
> the connection
> to the switches, may be different over runs -- I didn't
> try to fix
> the machine allocation within the cluster. The
> difference in the
> reduce results is at most small, in the order of
> magnitude that one
> would expect if the summation is carried out in a
> different order.
>
> My question: is it possible with MVapich2 that the
> internal order of
> the reduce operations is different, even if the number
> of machines
> is equal? Is it easy/possible to enforce a fixed order
> in the reduce
> implementation, just to verify this? Or should I
> suspect some bug of
> my own, like some weird memory corruption? My
> application also uses
> RDMA verbs natively; in principle that should work fine.
>
> Thank you for your advice,
>
> Rutger Hofman
> VU Amsterdam DAS5 http://www.cs.vu.nl/das5
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> <mailto:mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.cc
Type: text/x-c++src
Size: 6294 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151207/919d0d7b/attachment.bin>
More information about the mvapich-discuss
mailing list