[mvapich-discuss] MVAPICH2 1.9 crash on nodes with different IB speeds
Martin Cuma
martin.cuma at utah.edu
Tue Nov 5 15:21:41 EST 2013
Hi Hari,
thanks for the reply, we'll need some time to set up the nodes for the run
w/ the debugging flags you specified. In the meanwhile, this is how we
build MVAPICH2:
setenv CC gcc
setenv CXX g++
setenv FC gfortran
setenv F77 gfortran
setenv FFLAGS " -fPIC"
setenv FCFLAGS " -fPIC"
setenv CFLAGS " -fPIC -D_GNU_SOURC -D_GNU_SOURCE"
setenv CXXFLAGS " -fPIC"
setenv LDFLAGS "-fPIC"
../../../src/mvapich2/1.9/configure
--prefix=/uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9 --enable-romio
--with-file-system=nfs+ufs --enable-sharedlibs=gcc
--with-ib-include=/usr/include --with-ib-libpath=/usr/lib64
--enable-shared --with-device=ch3:nemesis:ib --enable-cxx
and mpiname -a returns similarly:
MVAPICH2 1.9 Mon May 6 12:25:08 EDT 2013 ch3:nemesis
Compilation
CC: gcc -fPIC -D_GNU_SOURC -D_GNU_SOURCE -DNDEBUG -DNVALGRIND -O2
CXX: g++ -fPIC -DNDEBUG -DNVALGRIND -O2
F77: gfortran -fPIC -O2
FC: gfortran -fPIC -O2
Configuration
--prefix=/uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9 --enable-romio
--with-file-system=nfs+ufs --enable-sharedlibs=gcc
--with-ib-include=/usr/include --with-ib-libpath=/usr/lib64
--enable-shared --with-device=ch3:nemesis:ib --enable-cxx
I'll get back to you when we get those debug logs.
Thanks,
MC
On Tue, 5 Nov 2013, Hari Subramoni wrote:
> Hi Martin,
>
> This is a little strange. MVAPICH2 has heterogeneity support and this should be working fine out of the box. Could you
> please give us some information about the configure options you used to build the MVAPICH2 library? You can do mpiname -a
> to obtain this information (assuming you have the MVAPICH2 install in your path, if not, it is available in the 'bin'
> directory of the MVAPICH2 install. From your e-mail, that should be
> /uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9/bin/mpiname).
>
> Do you have a debug build of MVAPICH2 available handy (MVAPICH2 configured with --enable-g=dbg --disable-fast)? If so,
> could you please re-run with MV2_DEBUG_SHOW_BACKTRACE=1 and MV2_SHOW_ENV_INFO=1 and give us the output?
>
> Thanks,
> Hari.
>
>
> On Thu, Oct 31, 2013 at 2:40 PM, Martin Cuma <martin.cuma at utah.edu> wrote:
> Hello,
>
> we think we have found a problem with MVAPICH2 running across nodes with different IB speeds.
>
> We had accidentally put a QDR cable on a node in a cluster with FDR fabric, which got us to reproduce this
> error easily. However, we do need in the near future to run across 2 clusters, one with QDR and one with FDR
> and had similar problems with tests in this regard as well. It sounds like MVAPICH2 should support mixed QDR
> and FDR that's why I am sending this report.
>
> What we are having as a test setup here is 4 nodes with SandyBridge dual 8 core CPUs (16 cores total), 3 with
> FDR and one with QDR.
>
> When running a MVAPICH2 job on all of the cores (64 total), the QDR node being the last (cores 49-64), we get a
> crash, e.g.:
> mpirun -np 64 -machinefile /tmp/mff /uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9/libexec/mvapich2/osu_reduce
> Fatal error in MPI_Init: Internal MPI error!
> Fatal error in MPI_Init: Internal MPI error!
> Fatal error in MPI_Init: Internal MPI error!
> Fatal error in MPI_Init: Internal MPI error!
>
> This crash occurs whenever we use more than 61 cores. If we use 61 cores or less (= using 13 or less cores from
> the QDR node), the program runs, e.g.
> mpirun -np 61 -machinefile /tmp/mff /uufs/kingspeak.peaks/sys/pkg/mvapich2/1.9/libexec/mvapich2/osu_reduce
> # OSU MPI Reduce Latency Test
> # Size Avg Latency(us)
> 4 10.46
> ....
>
> The program also runs fine on the FDR+QDR hardware if only 1 FDR and 1 QDR node is used, with total of 32
> cores. So, the behavior is dependent on number of the FDR cores/nodes, < 3 nodes works, 3 FDR nodes only work
> with 13 or less cores on single QDR node.
>
> I would be grateful if you could comment on this behavior, and possibly test this on your hardware to see if
> you can reproduce it. We'd like to get this fixed if possible so that we can use MVAPICH2 on our combined
> cluster with QDR and FDR.
>
> Just FYI, we also tried OpenMPI and that seems to be working fine.
>
> Thanks,
> MC
>
> --
> Martin Cuma
> Center for High Performance Computing
> Department of Geology and Geophysics
> University of Utah
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
>
--
Martin Cuma
Center for High Performance Computing
Department of Geology and Geophysics
University of Utah
More information about the mvapich-discuss
mailing list