[mvapich-discuss] MVAPICH2 1.9 crash on nodes with different IB speeds

Hari Subramoni subramoni.1 at osu.edu
Tue Nov 5 15:11:52 EST 2013


Hi Martin,

This is a little strange. MVAPICH2 has heterogeneity support and this
should be working fine out of the box. Could you please give us some
information about the configure options you used to build the MVAPICH2
library? You can do mpiname -a to obtain this information (assuming you
have the MVAPICH2 install in your path, if not, it is available in the
'bin' directory of the MVAPICH2 install. From your e-mail, that should be
/uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9/bin/mpiname).

Do you have a debug build of MVAPICH2 available handy (MVAPICH2 configured
with --enable-g=dbg --disable-fast)? If so, could you please re-run with
MV2_DEBUG_SHOW_BACKTRACE=1 and MV2_SHOW_ENV_INFO=1 and give us the output?

Thanks,
Hari.


On Thu, Oct 31, 2013 at 2:40 PM, Martin Cuma <martin.cuma at utah.edu> wrote:

> Hello,
>
> we think we have found a problem with MVAPICH2 running across nodes with
> different IB speeds.
>
> We had accidentally put a QDR cable on a node in a cluster with FDR
> fabric, which got us to reproduce this error easily. However, we do need in
> the near future to run across 2 clusters, one with QDR and one with FDR and
> had similar problems with tests in this regard as well. It sounds like
> MVAPICH2 should support mixed QDR and FDR that's why I am sending this
> report.
>
> What we are having as a test setup here is 4 nodes with SandyBridge dual 8
> core CPUs (16 cores total), 3 with FDR and one with QDR.
>
> When running a MVAPICH2 job on all of the cores (64 total), the QDR node
> being the last (cores 49-64), we get a crash, e.g.:
> mpirun -np 64 -machinefile /tmp/mff /uufs/kingspeak.peaks/sys/pkg/
> mvapich2/1.9/libexec/mvapich2/osu_reduce
> Fatal error in MPI_Init: Internal MPI error!
> Fatal error in MPI_Init: Internal MPI error!
> Fatal error in MPI_Init: Internal MPI error!
> Fatal error in MPI_Init: Internal MPI error!
>
> This crash occurs whenever we use more than 61 cores. If we use 61 cores
> or less (= using 13 or less cores from the QDR node), the program runs, e.g.
> mpirun -np 61 -machinefile /tmp/mff /uufs/kingspeak.peaks/sys/pkg/
> mvapich2/1.9/libexec/mvapich2/osu_reduce
> # OSU MPI Reduce Latency Test
> # Size         Avg Latency(us)
> 4                        10.46
> ....
>
> The program also runs fine on the FDR+QDR hardware if only 1 FDR and 1 QDR
> node is used, with total of 32 cores. So, the behavior is dependent on
> number of the FDR cores/nodes, < 3 nodes works, 3 FDR nodes only work with
> 13 or less cores on single QDR node.
>
> I would be grateful if you could comment on this behavior, and possibly
> test this on your hardware to see if you can reproduce it. We'd like to get
> this fixed if possible so that we can use MVAPICH2 on our combined cluster
> with QDR and FDR.
>
> Just FYI, we also tried OpenMPI and that seems to be working fine.
>
> Thanks,
> MC
>
> --
> Martin Cuma
> Center for High Performance Computing
> Department of Geology and Geophysics
> University of Utah
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131105/e06cb23b/attachment.html>


More information about the mvapich-discuss mailing list