[mvapich-discuss] Issue with mpi_alltoall on 64 nodes or more

Wed Apr 26 16:47:51 EDT 2006

We've narrowed down the MPI_Alltoall() problem a bit.  The customer
test code that showed the problem with MPI_Alltoall() had 128MB/N^2 as
its amount of data sent from each process, where N is the process
count.  For N==64, the message size comes out to 32768. For N<64, the
message size is more than 32768.

Looking at .../src/coll/intra_fns_new.c, it seems that MPI_Alltoall()
uses one of four different algorithms, depending on the message size
and number of processes. In our case, for N<64, it is using algorithm
3 or 4 (depending on whether N is a power of 2), the large message
algorithm using MPI_Sendrecv(). For N>=64, it is using algorithm 2,
the medium message algorithm that uses nonblocking MPI_Isend() and
MPI_Irecv() to and from all other processes, followed by
MPI_Waitall(). In other contexts we've seen some apparent starvation
issues with a large number of simultaneous nonblocking sends and
receives.

We're thinking of a workaround that bypasses algorithm 2 for medium
messages (I guess we could leave it in in the other case, namely short
messages with fewer than 8 processes). To do so, we would change the
definition of MPIR_ALLTOALL_MEDIUM_MSG in line 35 of intra_fns_new.c
from 32768 to:

#define MPIR_ALLTOALL_MEDIUM_MSG 256

How does this sound? We're thinking that using the large message
algorithm for medium-sized messages shouldn't hurt too much, and
may avoid the problems we've been seeing.

On Wednesday 26 April 2006 14:27, Rick Warner wrote:
> It gave the same slow behavior with the DISABLE_RDMA_ALLTOALL=1 addition.
> Another thing that has been tried is to split the machines list up so that
> 16 systems from each leaf switch are used.  With this configuration, it
> seems to run properly about 90% of the time, only sometimes taking multiple
> seconds to complete.
>
> On Wednesday 26 April 2006 01:48, Sayantan Sur wrote:
> > Hello Rick,
> >
> > * On Apr,1 Rick Warner<rick at microway.com> wrote :
> > > Hello all,
> > >  We are experiencing a problem on a medium sized infiniband cluster (89
> > > nodes).  mpi_alltoall on 64 or more nodes takes an excessively long
> > > time. On 63 nodes, it completes in a fraction of a second.  On 64, it
> > > takes about 20 seconds.
> >
> > Thanks for your report to the group. Could you please try to use the
> > Alltoall program like this:
> >
> > $ mpirun_rsh -np 64 -hostfile mf DISABLE_RDMA_ALLTOALL=1 ./a.out
> >
> > If you could report the result of this back, it will help us in
> > narrowing down the problem.
> >
> > Thanks,
> > Sayantan.

-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517