[mvapich-discuss] MVAPICH2 1.9 crash on nodes with different IB speeds

Martin Cuma martin.cuma at utah.edu
Thu Oct 31 14:40:15 EDT 2013


Hello,

we think we have found a problem with MVAPICH2 running across nodes with 
different IB speeds.

We had accidentally put a QDR cable on a node in a cluster with FDR 
fabric, which got us to reproduce this error easily. However, we do need 
in the near future to run across 2 clusters, one with QDR and one with FDR 
and had similar problems with tests in this regard as well. It sounds like 
MVAPICH2 should support mixed QDR and FDR that's why I am sending this 
report.

What we are having as a test setup here is 4 nodes with SandyBridge dual 8 
core CPUs (16 cores total), 3 with FDR and one with QDR.

When running a MVAPICH2 job on all of the cores (64 total), the QDR node 
being the last (cores 49-64), we get a crash, e.g.:
mpirun -np 64 -machinefile /tmp/mff 
/uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9/libexec/mvapich2/osu_reduce
Fatal error in MPI_Init: Internal MPI error!
Fatal error in MPI_Init: Internal MPI error!
Fatal error in MPI_Init: Internal MPI error!
Fatal error in MPI_Init: Internal MPI error!

This crash occurs whenever we use more than 61 cores. If we use 61 cores 
or less (= using 13 or less cores from the QDR node), the program runs, 
e.g.
mpirun -np 61 -machinefile /tmp/mff 
/uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9/libexec/mvapich2/osu_reduce
# OSU MPI Reduce Latency Test
# Size         Avg Latency(us)
4                        10.46
....

The program also runs fine on the FDR+QDR hardware if only 1 FDR and 1 QDR 
node is used, with total of 32 cores. So, the behavior is dependent on 
number of the FDR cores/nodes, < 3 nodes works, 3 FDR nodes only work with 
13 or less cores on single QDR node.

I would be grateful if you could comment on this behavior, and possibly 
test this on your hardware to see if you can reproduce it. We'd like to 
get this fixed if possible so that we can use MVAPICH2 on our combined 
cluster with QDR and FDR.

Just FYI, we also tried OpenMPI and that seems to be working fine.

Thanks,
MC

-- 
Martin Cuma
Center for High Performance Computing
Department of Geology and Geophysics
University of Utah



More information about the mvapich-discuss mailing list