[mvapich-discuss] MPI_Reduce(MPI_SUM) order
Rutger Hofman
rutger at cs.vu.nl
Fri Dec 4 04:11:57 EST 2015
Good morning,
my application uses MVapich2 (locally labeled mvapich2/gcc/64/2.0b) over
Infiniband in a CentOS cluster. I notice the following. When I
repeatedly run the application, the result of an MPI_Reduce(...,
MPI_FLOAT, ..., MPI_SUM) over an array of floats may be different over
various runs, although the inputs are exactly the same (I checked the
bit patterns of the floats), the number of machines is the same, etc
etc. The actual machines allocated, and the connection to the switches,
may be different over runs -- I didn't try to fix the machine allocation
within the cluster. The difference in the reduce results is at most
small, in the order of magnitude that one would expect if the summation
is carried out in a different order.
My question: is it possible with MVapich2 that the internal order of the
reduce operations is different, even if the number of machines is equal?
Is it easy/possible to enforce a fixed order in the reduce
implementation, just to verify this? Or should I suspect some bug of my
own, like some weird memory corruption? My application also uses RDMA
verbs natively; in principle that should work fine.
Thank you for your advice,
Rutger Hofman
VU Amsterdam DAS5 http://www.cs.vu.nl/das5
More information about the mvapich-discuss
mailing list