[mvapich-discuss] MVAPICH2 1.9 crash on nodes with different IB speeds
Martin Cuma
martin.cuma at utah.edu
Thu Oct 31 14:40:15 EDT 2013
Hello,
we think we have found a problem with MVAPICH2 running across nodes with
different IB speeds.
We had accidentally put a QDR cable on a node in a cluster with FDR
fabric, which got us to reproduce this error easily. However, we do need
in the near future to run across 2 clusters, one with QDR and one with FDR
and had similar problems with tests in this regard as well. It sounds like
MVAPICH2 should support mixed QDR and FDR that's why I am sending this
report.
What we are having as a test setup here is 4 nodes with SandyBridge dual 8
core CPUs (16 cores total), 3 with FDR and one with QDR.
When running a MVAPICH2 job on all of the cores (64 total), the QDR node
being the last (cores 49-64), we get a crash, e.g.:
mpirun -np 64 -machinefile /tmp/mff
/uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9/libexec/mvapich2/osu_reduce
Fatal error in MPI_Init: Internal MPI error!
Fatal error in MPI_Init: Internal MPI error!
Fatal error in MPI_Init: Internal MPI error!
Fatal error in MPI_Init: Internal MPI error!
This crash occurs whenever we use more than 61 cores. If we use 61 cores
or less (= using 13 or less cores from the QDR node), the program runs,
e.g.
mpirun -np 61 -machinefile /tmp/mff
/uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9/libexec/mvapich2/osu_reduce
# OSU MPI Reduce Latency Test
# Size Avg Latency(us)
4 10.46
....
The program also runs fine on the FDR+QDR hardware if only 1 FDR and 1 QDR
node is used, with total of 32 cores. So, the behavior is dependent on
number of the FDR cores/nodes, < 3 nodes works, 3 FDR nodes only work with
13 or less cores on single QDR node.
I would be grateful if you could comment on this behavior, and possibly
test this on your hardware to see if you can reproduce it. We'd like to
get this fixed if possible so that we can use MVAPICH2 on our combined
cluster with QDR and FDR.
Just FYI, we also tried OpenMPI and that seems to be working fine.
Thanks,
MC
--
Martin Cuma
Center for High Performance Computing
Department of Geology and Geophysics
University of Utah
More information about the mvapich-discuss
mailing list