[mvapich-discuss] Issue with mpi_alltoall on 64 nodes or more

Thu Apr 27 02:09:10 EDT 2006

Hello Rick,

Thank you for your detailed analysis of the problem. It seems that the
problem seems to stem from the pattern of communication, in particular,
the network doesn't seem to be able to withstand a lot of non-blocking
all-to-all communication. Typically, the InfiniBand clusters we have had
experience with, we haven't seen this type of behavior.

The workaround you suggest for changing the threshold of all-to-all,
will work for this benchmark program. However, any other MPI application
code may exhibit this kind of behavior, which might be difficult to
change from the MPI library :-) Besides, the non-blocking algorithm
suggested in MPICH and the RDMA version suggested by us
(intra_rdma_alltoall.c) has been proven to be optimal for most IB
clusters. Hence, switching to the other algorithm may give a performance
hit.

Once again thanks for your analysis and sharing it with the group.

Thanks,
Sayantan.

* On Apr,4 Rick Warner<rick at microway.com> wrote :
> We've narrowed down the MPI_Alltoall() problem a bit.  The customer
> test code that showed the problem with MPI_Alltoall() had 128MB/N^2 as
> its amount of data sent from each process, where N is the process
> count.  For N==64, the message size comes out to 32768. For N<64, the
> message size is more than 32768.
> 
> Looking at .../src/coll/intra_fns_new.c, it seems that MPI_Alltoall()
> uses one of four different algorithms, depending on the message size
> and number of processes. In our case, for N<64, it is using algorithm
> 3 or 4 (depending on whether N is a power of 2), the large message
> algorithm using MPI_Sendrecv(). For N>=64, it is using algorithm 2,
> the medium message algorithm that uses nonblocking MPI_Isend() and
> MPI_Irecv() to and from all other processes, followed by
> MPI_Waitall(). In other contexts we've seen some apparent starvation
> issues with a large number of simultaneous nonblocking sends and
> receives.
> 
> We're thinking of a workaround that bypasses algorithm 2 for medium
> messages (I guess we could leave it in in the other case, namely short
> messages with fewer than 8 processes). To do so, we would change the
> definition of MPIR_ALLTOALL_MEDIUM_MSG in line 35 of intra_fns_new.c
> from 32768 to:
> 
> #define MPIR_ALLTOALL_MEDIUM_MSG 256
> 
> How does this sound? We're thinking that using the large message
> algorithm for medium-sized messages shouldn't hurt too much, and
> may avoid the problems we've been seeing.
> 
> On Wednesday 26 April 2006 14:27, Rick Warner wrote:
> > It gave the same slow behavior with the DISABLE_RDMA_ALLTOALL=1 addition.
> > Another thing that has been tried is to split the machines list up so that
> > 16 systems from each leaf switch are used.  With this configuration, it
> > seems to run properly about 90% of the time, only sometimes taking multiple
> > seconds to complete.
> >
> > On Wednesday 26 April 2006 01:48, Sayantan Sur wrote:
> > > Hello Rick,
> > >
> > > * On Apr,1 Rick Warner<rick at microway.com> wrote :
> > > > Hello all,
> > > >  We are experiencing a problem on a medium sized infiniband cluster (89
> > > > nodes).  mpi_alltoall on 64 or more nodes takes an excessively long
> > > > time. On 63 nodes, it completes in a fraction of a second.  On 64, it
> > > > takes about 20 seconds.
> > >
> > > Thanks for your report to the group. Could you please try to use the
> > > Alltoall program like this:
> > >
> > > $ mpirun_rsh -np 64 -hostfile mf DISABLE_RDMA_ALLTOALL=1 ./a.out
> > >
> > > If you could report the result of this back, it will help us in
> > > narrowing down the problem.
> > >
> > > Thanks,
> > > Sayantan.
> 
> -- 
> Richard Warner
> Lead Systems Integrator
> Microway, Inc
> (508)732-5517

-- 
http://www.cse.ohio-state.edu/~surs