[mvapich-discuss] Bug in Allreduce for user-defined ops

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed Oct 8 16:29:44 EDT 2008


To update the list users - This issue has been resolved. Jack has tested
the solution. Patches #3060 and #3061 have been added to mvapich 1.0.1
branch and mavpich trunk (for 1.1), respectively.

Jack - Thanks for your help in testing this solution.

We are closing this report.

Thanks,

DK

On Sun, 28 Sep 2008, Jack Poulson wrote:

> I believe I've run into a bug in the implementation of Allreduce for
> user-defined functions in MVAPICH 1.0 and 1.0.1 (0.9.8 works).
>
> In 0.9.8, for power-of-two processes, the user-op is called log2 times
> with the correct length. In the new versions, it appears to be called
> log2+2 times, where the first call to the user-op passes in a count of
> zero (I found this by simply printing it from within the user-op).
> I've looked through the intra_Allreduce routine in
> src/coll/intra_fns_new.c, but I don't see why the user-op is called
> more than log2 times for power-of-two processes.
>
> Should user-defined ops check to ensure the length is nonzero? I've
> attached a driver and output that demonstrate the problem. The issue
> causes problems in operations such as a custom pivoting operation in
> an LU factorization, where an integer is tacked onto the end of a set
> of doubles, and a zero length in bytes would cause the routine to
> decide negative doubles are being operated on. I've been working
> around the problem with a custom Allreduce implementation that uses a
> reduce-to-one/bcast, but I would like to take advantage of your team's
> multicore optimizations.
>
> Thank you,
> Jack Poulson
>



More information about the mvapich-discuss mailing list