[mvapich-discuss] Issue with mpi_alltoall on 64 nodes or more
Rick Warner
rick at microway.com
Tue Apr 25 14:10:32 EDT 2006
Hello all,
We are experiencing a problem on a medium sized infiniband cluster (89
nodes). mpi_alltoall on 64 or more nodes takes an excessively long time. On
63 nodes, it completes in a fraction of a second. On 64, it takes about 20
seconds.
We're using the latest 1.8.2 IB gold from mellanox for the drivers. We're
using the 0.97 stable svn trunk for mvapich.
The infiniband fabric is connected as follows: 4 @ 36 port leaf switches and
2 @ 24 port spine switches. 6 cables from each leaf go to each spine ( leaf
A has 6 cables to spine A and 6 cables to spine B, etc, etc).
We have tried using open mpi instead, and it completes in a fraction for both
63 and 64. Also, mvapich with 64 processes on 32 nodes (2 each instead of 1
each), the alltoall completes quickly as well. Open mpi does not perform as
well as mvapich overall, but it's enabled us to narrow down this problem. At
this point, we believe the problem to either be a bug in mvapich itself, or
to be a hardware problem that is triggered by a method mvapich uses.
Any ideas on this?
--
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517
More information about the mvapich-discuss
mailing list