[mvapich-discuss] MPI code failure - help to diagnose

Vladimir Florinski vaf0001 at uah.edu
Wed Oct 19 16:12:45 EDT 2016


Here is the output of mpiname:

MVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:mrail

Compilation
CC: gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector-strong --param=ssp-buffer-size=4
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64
-mtune=generic   -DNDEBUG -DNVALGRIND -g -O2
CXX: g++ -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector-strong --param=ssp-buffer-size=4
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64
-mtune=generic  -DNDEBUG -DNVALGRIND -g -O2
F77: gfortran -O2 -g -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong
--param=ssp-buffer-size=4 -grecord-gcc-switches
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
-I/usr/lib64/gfortran/modules  -g -O2
FC: gfortran -O2 -g -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong
--param=ssp-buffer-size=4 -grecord-gcc-switches
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
-I/usr/lib64/gfortran/modules  -g -O2

Configuration
--build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu
--program-prefix= --disable-dependency-tracking --prefix=/usr
--exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc
--datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64
--libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib
--mandir=/usr/share/man --infodir=/usr/share/info --with-device=ch3:mrail
--with-rdma=gen2 --with-pmi=pmi2 --with-pm=slurm --enable-g=dbg
--enable-debuginfo --enable-cuda --with-cuda=/usr/local/cuda
build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu
CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector-strong --param=ssp-buffer-size=4
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64
-mtune=generic LDFLAGS=-Wl,-z,relro
-specs=/usr/lib/rpm/redhat/redhat-hardened-ld CXXFLAGS=-O2 -g -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
FCFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector-strong --param=ssp-buffer-size=4
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64
-mtune=generic -I/usr/lib64/gfortran/modules FFLAGS=-O2 -g -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
-I/usr/lib64/gfortran/modules --no-create --no-recursion


As we can see, slurm support is enabled.


Running the code with debug option produced the following output (for 24
processes requested):

[node63:mpi_rank_18][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7fa85a56c351]
[node63:mpi_rank_18][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7fa85a50d18f]
[node63:mpi_rank_18][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7fa85a4faf0f]
[node63:mpi_rank_18][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7fa85a4c5a0e]
[node63:mpi_rank_18][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7fa85a4c5b58]
[node63:mpi_rank_18][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7fa85a4822d4]
[node63:mpi_rank_18][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_18][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7fa859c39731]
[node63:mpi_rank_18][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node63:mpi_rank_12][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f8341ad7351]
[node63:mpi_rank_12][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f8341a7818f]
[node63:mpi_rank_12][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f8341a65f0f]
[node63:mpi_rank_12][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f8341a30a0e]
[node63:mpi_rank_12][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f8341a30b58]
[node63:mpi_rank_12][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7f83419ed2d4]
[node63:mpi_rank_12][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_12][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7f83411a4731]
[node63:mpi_rank_12][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node63:mpi_rank_13][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f958ddea351]
[node63:mpi_rank_13][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f958dd8b18f]
[node63:mpi_rank_13][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f958dd78f0f]
[node63:mpi_rank_13][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f958dd43a0e]
[node63:mpi_rank_13][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f958dd43b58]
[node63:mpi_rank_13][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7f958dd002d4]
[node63:mpi_rank_13][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_13][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7f958d4b7731]
[node63:mpi_rank_13][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node63:mpi_rank_22][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7fdf42372351]
[node63:mpi_rank_22][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7fdf4231318f]
[node63:mpi_rank_22][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7fdf42300f0f]
[node63:mpi_rank_22][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7fdf422cba0e]
[node63:mpi_rank_22][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7fdf422cbb58]
[node63:mpi_rank_22][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7fdf422882d4]
[node63:mpi_rank_22][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_22][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7fdf41a3f731]
[node63:mpi_rank_22][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node63:mpi_rank_23][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7fe22fafd351]
[node63:mpi_rank_23][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7fe22fa9e18f]
[node63:mpi_rank_23][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7fe22fa8bf0f]
[node63:mpi_rank_23][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7fe22fa56a0e]
[node63:mpi_rank_23][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7fe22fa56b58]
[node63:mpi_rank_23][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7fe22fa132d4]
[node63:mpi_rank_23][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_23][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7fe22f1ca731]
[node63:mpi_rank_23][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node63:mpi_rank_20][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f06c26a7351]
[node63:mpi_rank_20][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f06c264818f]
[node63:mpi_rank_20][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f06c2635f0f]
[node63:mpi_rank_20][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f06c2600a0e]
[node63:mpi_rank_20][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f06c2600b58]
[node63:mpi_rank_20][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7f06c25bd2d4]
[node63:mpi_rank_20][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_20][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7f06c1d74731]
[node63:mpi_rank_20][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node63:mpi_rank_17][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f90c7dd7351]
[node63:mpi_rank_17][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f90c7d7818f]
[node63:mpi_rank_17][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f90c7d65f0f]
[node63:mpi_rank_17][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f90c7d30a0e]
[node63:mpi_rank_17][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f90c7d30b58]
[node63:mpi_rank_17][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7f90c7ced2d4]
[node63:mpi_rank_17][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_17][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7f90c74a4731]
[node63:mpi_rank_17][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node63:mpi_rank_19][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f46e7371351]
[node63:mpi_rank_19][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f46e731218f]
[node63:mpi_rank_19][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f46e72fff0f]
[node63:mpi_rank_19][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f46e72caa0e]
[node63:mpi_rank_19][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f46e72cab58]
[node63:mpi_rank_19][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7f46e72872d4]
[node63:mpi_rank_19][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_19][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7f46e6a3e731]
[node63:mpi_rank_19][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node63:mpi_rank_21][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7fbc7b42f351]
[node63:mpi_rank_21][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7fbc7b3d018f]
[node63:mpi_rank_21][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7fbc7b3bdf0f]
[node63:mpi_rank_21][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7fbc7b388a0e]
[node63:mpi_rank_21][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7fbc7b388b58]
[node63:mpi_rank_21][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7fbc7b3452d4]
[node63:mpi_rank_21][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_21][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7fbc7aafc731]
[node63:mpi_rank_21][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[node63:mpi_rank_16][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f9c45317351]
[node63:mpi_rank_16][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f9c452b818f]
[node63:mpi_rank_16][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f9c452a5f0f]
[node63:mpi_rank_16][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f9c45270a0e]
[node63:mpi_rank_16][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f9c45270b58]
[node63:mpi_rank_16][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7f9c4522d2d4]
[node63:mpi_rank_16][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_16][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7f9c449e4731]
[node63:mpi_rank_16][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
slurmstepd: error: *** STEP 63.0 ON node62 CANCELLED AT 2016-10-19T15:02:47
***
[node63:mpi_rank_15][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f2334ac3351]
[node63:mpi_rank_15][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f2334a6418f]
[node63:mpi_rank_15][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f2334a51f0f]
[node63:mpi_rank_15][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f2334a1ca0e]
[node63:mpi_rank_15][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f2334a1cb58]
[node63:mpi_rank_15][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7f23349d92d4]
[node63:mpi_rank_15][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_15][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7f2334190731]
[node63:mpi_rank_15][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node63:mpi_rank_14][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f45b5d59351]
[node63:mpi_rank_14][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f45b5cfa18f]
[node63:mpi_rank_14][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f45b5ce7f0f]
[node63:mpi_rank_14][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f45b5cb2a0e]
[node63:mpi_rank_14][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f45b5cb2b58]
[node63:mpi_rank_14][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7f45b5c6f2d4]
[node63:mpi_rank_14][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node63:mpi_rank_14][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7f45b5426731]
[node63:mpi_rank_14][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node62:mpi_rank_0][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7faafe982351]
[node62:mpi_rank_0][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7faafe92318f]
[node62:mpi_rank_0][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7faafe910f0f]
[node62:mpi_rank_0][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7faafe8dba0e]
[node62:mpi_rank_0][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7faafe8dbb58]
[node62:mpi_rank_0][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7faafe8982d4]
[node62:mpi_rank_0][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node62:mpi_rank_0][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7faafe04f731]
[node62:mpi_rank_0][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node62:mpi_rank_11][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f7eb9511351]
[node62:mpi_rank_1][print_backtrace]   0:
/usr/lib64/libmpi.so.12(print_backtrace+0x31) [0x7f60324e3351]
[node62:mpi_rank_1][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f603248418f]
[node62:mpi_rank_1][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f6032471f0f]
[node62:mpi_rank_1][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f603243ca0e]
[node62:mpi_rank_1][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f603243cb58]
[node62:mpi_rank_1][print_backtrace]   5:
/usr/lib64/libmpi.so.12(MPI_Init+0x104) [0x7f60323f92d4]
[node62:mpi_rank_1][print_backtrace]   6:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400728]
[node62:mpi_rank_1][print_backtrace]   7:
/usr/lib64/libc.so.6(__libc_start_main+0xf1) [0x7f6031bb0731]
[node62:mpi_rank_1][print_backtrace]   8:
/data/vladimir/Global3D/Geodesic5a_r1/./a.out() [0x400639]
[node62:mpi_rank_11][print_backtrace]   1:
/usr/lib64/libmpi.so.12(MPIDI_CH3_Abort+0xaf) [0x7f7eb94b218f]
[node62:mpi_rank_11][print_backtrace]   2:
/usr/lib64/libmpi.so.12(MPID_Abort+0x17f) [0x7f7eb949ff0f]
[node62:mpi_rank_11][print_backtrace]   3:
/usr/lib64/libmpi.so.12(+0x2cfa0e) [0x7f7eb946aa0e]
[node62:mpi_rank_11][print_backtrace]   4:
/usr/lib64/libmpi.so.12(MPIR_Err_return_comm+0x118) [0x7f7eb946ab58]
srun: error: node63: tasks 12-23: Exited with exit code 1
srun: error: node62: tasks 0-1: Exited with exit code 1
srun: error: node62: tasks 2-11: Killed


Best,



On Wed, Oct 19, 2016 at 2:48 PM, Hari Subramoni <subramoni.1 at osu.edu> wrote:

> Hi Vladimir,
>
> Did you configure MVAPICH2 with SLURM support? Please refer to the
> following section of the MVAPICH2 userguide for information on how to do
> this.
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapi
> ch2-2.2-userguide.html#x1-100004.3.2
>
> Can you please send the output of mpiname -a? This will tell us how
> MVAPICH2 was built.
>
> To diagnose runtime issues and to get backtrace, you can rerun your
> program after adding "MV2_DEBUG_SHOW_BACKTRACE=2" to the environment using
> the export command (export MV2_DEBUG_SHOW_BACKTRACE=2).
>
> If you have compiled MVAPICH2 with debugging options, the above command
> will provide a detailed backtrace. If not, the backtrace may be limited.
> You need to add "--enable-g=gdb --enable-fast=none" to the MVAPICH2
> configure line to enable debugging support.
>
> Please refer to the following section of the MVAPICH2 userguide for
> information on how to do this.
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/
> mvapich2-2.2-userguide.html#x1-1310009.1.14
>
> Regards,
> Hari.
>
> On Wed, Oct 19, 2016 at 2:10 PM, Vladimir Florinski <vaf0001 at uah.edu>
> wrote:
>
>> On deploying mvapich2 version 2.2 on our cluster we found that MPI codes
>> refuse to run across multiple nodes. The installation uses slurm as the
>> process manager. Non-mpi codes run OK across any number of nodes. MPI codes
>> run OK on a single node using any number of locally available cores (16 in
>> this case). However, MPI codes fail on more than one node. For example
>>
>> srun --mpi=pmi2 -n 16 ./a.out            runs OK
>> srun --mpi=pmi2 -n 17 ./a.out            fails
>>
>> (the code consists of MPI_Init() and MPI_Finalize() only). The message is
>> very generic, so it offers little help:
>>
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> slurmstepd: error: *** STEP 59.0 ON node62 CANCELLED AT
>> 2016-10-19T12:49:10 ***
>> srun: error: node63: tasks 9-16: Exited with exit code 1
>> srun: error: node62: tasks 0-8: Killed
>>
>> The facts seem to eliminate a slurm error and point to an issue with
>> Infiniband. That part has been tested thoroughly and all diagnostics
>> completed without errors. There is no firewall running. I am rather out of
>> ideas at this point and would welcome advice on troubleshooting the problem.
>>
>> Thanks,
>>
>> --
>> Vladimir Florinski
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>


-- 
Vladimir Florinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161019/ce823f09/attachment-0001.html>


More information about the mvapich-discuss mailing list