[mvapich-discuss] MPI_Reduce(MPI_SUM) order

Rutger Hofman rutger at cs.vu.nl
Mon Dec 7 10:35:16 EST 2015


Please find attached main.cc. I use mpirun_rsh -ssh 
MV2_ENABLE_AFFINITY=0, I don't know if that makes any difference -- it 
shouldn't for one singlethreaded process per machine.

The float values have been imported as hex bit patterns that I dumped 
from my application, so as to exclude any ascii <-> float conversion 
issues. The bit pattern is then casted back to floats.

The application takes 1 parameter: the number of iterations.

Rutger

On 12/07/2015 04:11 PM, Hari Subramoni wrote:
> Hi,
>
> I tried a similar program locally and was not able to see the issue you
> mentioned. We did not see any validation errors. Could you please share
> your reproducer with us so that we can try that out also?
>
> Thx,
> Hari,
>
> On Mon, Dec 7, 2015 at 4:30 AM, Rutger Hofman <rutger at cs.vu.nl
> <mailto:rutger at cs.vu.nl>> wrote:
>
>     Update: when run on 3 machines, iteration 2340 (counting starts at
>     0) gives a different result. On 4 machines, iteration 16 gives a
>     different result, the same as 5 machines. On 2 machines, I ran
>     1000000 iterations without error.
>
>     Rutger Hofman
>     VU Amsterdam
>     http://www.cs.vu.nl/das5
>
>
>     On 12/07/2015 09:55 AM, Rutger Hofman wrote:
>
>         I wrote a little stand-alone C++ program to try and narrow the
>         issue. It
>         performs a tight loop of MPI_Reduce(float[K], ... MPI_SUM, ...) with
>         K=64, invoked with identical parameters of well-formed floats. It
>         compares the result of later iterations with the first result. Since
>         MPI_Reduce is deterministic in its spanning tree, these should be
>         bit-identical.
>
>         The program ran as an application on 5 machines, one
>         thread/process per
>         machine. After 16 iterations, the result differs from the first
>         result.
>         Similar to my issue reported below, the difference seems approx.
>         in the
>         7th significant digit in quite a number of array fields -- this
>         might
>         even count for 'floating point correct', but I suspect that is an
>         artifact rather than a feature.
>
>         Conclusion: MPI_Reduce is /not/ deterministic, even within one run.
>         Since you explain that it should be deterministic, my guess is
>         that some
>         interal MVapich2 state gets corrupted (and I don't see reasons to
>         primarily suspect the spanning tree).
>
>         Should I post my code for ease of debugging? Are there other
>         things that
>         I can do?
>
>         Rutger
>
>         On 12/04/2015 11:58 PM, Hari Subramoni wrote:
>
>             Hello,
>
>             If the configuration chosen to run the MPI job is retained
>             then MVAPICH2
>             retains the order of operations and hence no non-determinism
>             exists.
>             This makes reduction operations bitwise reproducible.
>
>             However, the same guarantees can't be made if the job is
>             first run as 2
>             nodes with 4 processes per node and then as 4 nodes with 2
>             processes per
>             node. Further, no guarantees can be made across multiple sets of
>             machines due to the inherent non-determinism related to the
>             results of
>             floating point operations at very high precision.
>
>             Hope this helps.
>
>             Regards,
>             Hari.
>
>             On Fri, Dec 4, 2015 at 4:11 AM, Rutger Hofman
>             <rutger at cs.vu.nl <mailto:rutger at cs.vu.nl>
>             <mailto:rutger at cs.vu.nl <mailto:rutger at cs.vu.nl>>> wrote:
>
>                  Good morning,
>
>                  my application uses MVapich2 (locally labeled
>             mvapich2/gcc/64/2.0b)
>                  over Infiniband in a CentOS cluster. I notice the
>             following. When I
>                  repeatedly run the application, the result of an
>             MPI_Reduce(...,
>                  MPI_FLOAT, ..., MPI_SUM) over an array of floats may be
>             different
>                  over various runs, although the inputs are exactly the
>             same (I
>                  checked the bit patterns of the floats), the number of
>             machines is
>                  the same, etc etc. The actual machines allocated, and
>             the connection
>                  to the switches, may be different over runs -- I didn't
>             try to fix
>                  the machine allocation within the cluster. The
>             difference in the
>                  reduce results is at most small, in the order of
>             magnitude that one
>                  would expect if the summation is carried out in a
>             different order.
>
>                  My question: is it possible with MVapich2 that the
>             internal order of
>                  the reduce operations is different, even if the number
>             of machines
>                  is equal? Is it easy/possible to enforce a fixed order
>             in the reduce
>                  implementation, just to verify this? Or should I
>             suspect some bug of
>                  my own, like some weird memory corruption? My
>             application also uses
>                  RDMA verbs natively; in principle that should work fine.
>
>                  Thank you for your advice,
>
>                  Rutger Hofman
>                  VU Amsterdam DAS5 http://www.cs.vu.nl/das5
>                  _______________________________________________
>                  mvapich-discuss mailing list
>             mvapich-discuss at cse.ohio-state.edu
>             <mailto:mvapich-discuss at cse.ohio-state.edu>
>                  <mailto:mvapich-discuss at cse.ohio-state.edu
>             <mailto:mvapich-discuss at cse.ohio-state.edu>>
>             http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.cc
Type: text/x-c++src
Size: 6294 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151207/919d0d7b/attachment.bin>


More information about the mvapich-discuss mailing list