[mvapich-discuss] Issue with mpi_alltoall on 64 nodes or more

Thu Apr 27 02:42:41 EDT 2006

Hi ,
In IBGD2 release (mvapich-gen2)the Alltoall medium size flow was 
disabled by default (actually it is run time parameter).

Sayantan,
I'm not sure but I think the patch was send to you. Anyway please let me 
know if you need it.

Regards,
Pasha (Pavel Shamis)

Rick Warner wrote:
> We've narrowed down the MPI_Alltoall() problem a bit.  The customer
> test code that showed the problem with MPI_Alltoall() had 128MB/N^2 as
> its amount of data sent from each process, where N is the process
> count.  For N==64, the message size comes out to 32768. For N<64, the
> message size is more than 32768.
> 
> Looking at .../src/coll/intra_fns_new.c, it seems that MPI_Alltoall()
> uses one of four different algorithms, depending on the message size
> and number of processes. In our case, for N<64, it is using algorithm
> 3 or 4 (depending on whether N is a power of 2), the large message
> algorithm using MPI_Sendrecv(). For N>=64, it is using algorithm 2,
> the medium message algorithm that uses nonblocking MPI_Isend() and
> MPI_Irecv() to and from all other processes, followed by
> MPI_Waitall(). In other contexts we've seen some apparent starvation
> issues with a large number of simultaneous nonblocking sends and
> receives.
> 
> We're thinking of a workaround that bypasses algorithm 2 for medium
> messages (I guess we could leave it in in the other case, namely short
> messages with fewer than 8 processes). To do so, we would change the
> definition of MPIR_ALLTOALL_MEDIUM_MSG in line 35 of intra_fns_new.c
> from 32768 to:
> 
> #define MPIR_ALLTOALL_MEDIUM_MSG 256
> 
> How does this sound? We're thinking that using the large message
> algorithm for medium-sized messages shouldn't hurt too much, and
> may avoid the problems we've been seeing.
> 
> On Wednesday 26 April 2006 14:27, Rick Warner wrote:
>> It gave the same slow behavior with the DISABLE_RDMA_ALLTOALL=1 addition.
>> Another thing that has been tried is to split the machines list up so that
>> 16 systems from each leaf switch are used.  With this configuration, it
>> seems to run properly about 90% of the time, only sometimes taking multiple
>> seconds to complete.
>>
>> On Wednesday 26 April 2006 01:48, Sayantan Sur wrote:
>>> Hello Rick,
>>>
>>> * On Apr,1 Rick Warner<rick at microway.com> wrote :
>>>> Hello all,
>>>>  We are experiencing a problem on a medium sized infiniband cluster (89
>>>> nodes).  mpi_alltoall on 64 or more nodes takes an excessively long
>>>> time. On 63 nodes, it completes in a fraction of a second.  On 64, it
>>>> takes about 20 seconds.
>>> Thanks for your report to the group. Could you please try to use the
>>> Alltoall program like this:
>>>
>>> $ mpirun_rsh -np 64 -hostfile mf DISABLE_RDMA_ALLTOALL=1 ./a.out
>>>
>>> If you could report the result of this back, it will help us in
>>> narrowing down the problem.
>>>
>>> Thanks,
>>> Sayantan.
>