[mvapich-discuss] Issue with mpi_alltoall on 64 nodes or more

Fri Apr 28 04:07:26 EDT 2006

Hi Pasha,

Pavel Shamis (Pasha) wrote:

> Hi ,
> In IBGD2 release (mvapich-gen2)the Alltoall medium size flow was 
> disabled by default (actually it is run time parameter).
>
> Sayantan,
> I'm not sure but I think the patch was send to you. Anyway please let 
> me know if you need it.

Thanks for your email to the MVAPICH community. Yes, we received the 
patch... however, upon review it was apparent that this is just a 
workaround and not a real solution. As I said in the earlier email, 
there is nothing to stop an MPI application from executing the same 
communication pattern. Also, disabling algorithms that have been proven 
to have the best performance for most clusters is not a real solution in 
any case. I think a 64 node cluster _should_ be able to handle such a 
communication pattern if the network is behaving properly. It is clear 
that the network is unable to handle an intense communication pattern.

Lastly, if cluster users (or vendors) are willing to disable the 
optimized algorithm to get around the network issues, then they are free 
to use the suggestion given by Rick (ie. disable the algorithm). 
However, they should be aware that it is only a workaround.

Thanks,
Sayantan.

>
> Regards,
> Pasha (Pavel Shamis)
>
> Rick Warner wrote:
>
>> We've narrowed down the MPI_Alltoall() problem a bit.  The customer
>> test code that showed the problem with MPI_Alltoall() had 128MB/N^2 as
>> its amount of data sent from each process, where N is the process
>> count.  For N==64, the message size comes out to 32768. For N<64, the
>> message size is more than 32768.
>>
>> Looking at .../src/coll/intra_fns_new.c, it seems that MPI_Alltoall()
>> uses one of four different algorithms, depending on the message size
>> and number of processes. In our case, for N<64, it is using algorithm
>> 3 or 4 (depending on whether N is a power of 2), the large message
>> algorithm using MPI_Sendrecv(). For N>=64, it is using algorithm 2,
>> the medium message algorithm that uses nonblocking MPI_Isend() and
>> MPI_Irecv() to and from all other processes, followed by
>> MPI_Waitall(). In other contexts we've seen some apparent starvation
>> issues with a large number of simultaneous nonblocking sends and
>> receives.
>>
>> We're thinking of a workaround that bypasses algorithm 2 for medium
>> messages (I guess we could leave it in in the other case, namely short
>> messages with fewer than 8 processes). To do so, we would change the
>> definition of MPIR_ALLTOALL_MEDIUM_MSG in line 35 of intra_fns_new.c
>> from 32768 to:
>>
>> #define MPIR_ALLTOALL_MEDIUM_MSG 256
>>
>> How does this sound? We're thinking that using the large message
>> algorithm for medium-sized messages shouldn't hurt too much, and
>> may avoid the problems we've been seeing.
>>
>> On Wednesday 26 April 2006 14:27, Rick Warner wrote:
>>
>>> It gave the same slow behavior with the DISABLE_RDMA_ALLTOALL=1 
>>> addition.
>>> Another thing that has been tried is to split the machines list up 
>>> so that
>>> 16 systems from each leaf switch are used.  With this configuration, it
>>> seems to run properly about 90% of the time, only sometimes taking 
>>> multiple
>>> seconds to complete.
>>>
>>> On Wednesday 26 April 2006 01:48, Sayantan Sur wrote:
>>>
>>>> Hello Rick,
>>>>
>>>> * On Apr,1 Rick Warner<rick at microway.com> wrote :
>>>>
>>>>> Hello all,
>>>>>  We are experiencing a problem on a medium sized infiniband 
>>>>> cluster (89
>>>>> nodes).  mpi_alltoall on 64 or more nodes takes an excessively long
>>>>> time. On 63 nodes, it completes in a fraction of a second.  On 64, it
>>>>> takes about 20 seconds.
>>>>
>>>> Thanks for your report to the group. Could you please try to use the
>>>> Alltoall program like this:
>>>>
>>>> $ mpirun_rsh -np 64 -hostfile mf DISABLE_RDMA_ALLTOALL=1 ./a.out
>>>>
>>>> If you could report the result of this back, it will help us in
>>>> narrowing down the problem.
>>>>
>>>> Thanks,
>>>> Sayantan.
>>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
http://www.cse.ohio-state.edu/~surs