[mvapich-discuss] problem with MVAPICH+Slurm

Manuel Rodríguez Pascual manuel.rodriguez.pascual at gmail.com
Mon Dec 12 10:59:40 EST 2016


Hi all,

I am trying to configure mvapich and Slurm to work together. I keep however
having some problems. I feel that it must be something pretty obvious, but
I just cannot find the issue. My final objective is that mvapich employes
Slurm resource manager.

I am using mvapich2-2.2 and slurm 15.08.12 (although the same problem
arises with newer versions)

As a first test, compiling with:

./configure --prefix=/home/localsoft/mvapich2
--with-pm=mpirun:hydra--disable-mcast

This works OK, but uses hydra (that I don't want to)

However, when compiling following mvapich manual it does not work as
expected

./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
--with-slurm=/home/localsoft/slurm --with-pmi=slurm --with-pm=none
-enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo
--enable-mpit-pvars=all --enable-check-compiler-flags
--enable-threads=multiple --enable-weak-symbols  --enable-fast-install
--enable-g=dbg --enable-error-messages=all --enable-error-checking=all

mpicc  helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib

This works fine when I am running jobs on a single node. ie:
-bash-4.2$ srun -n 2 --tasks-per-node=2 helloWorldMPISrun.sh
Process 0 of 2 is on acme11.ciemat.es
Process 1 of 2 is on acme11.ciemat.es
Hello world from process 0 of 2
Hello world from process 1 of 2


But when I try to run it on two nodes, it crashes. I have included "export
MV2_DEBUG_SHOW_BACKTRACE=1" in my script for debbuging info


-bash-4.2$ srun -n 2 --tasks-per-node=1 helloWorldMPISrun.sh
[acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error: Segmentation
fault (signal 11)
[acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error: Segmentation
fault (signal 11)
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   0:
/home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c)
[0x7fbff14d529c]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   1:
/home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
[0x7fbff14d5399]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   2:
/usr/lib64/libc.so.6(+0x35670) [0x7fbff0bd2670]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   3:
/home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
[0x7fbff14cda91]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   4:
/home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
[0x7fbff14ce0ac]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   5:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
[0x7fbff14a612f]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   6:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
[0x7fbff148e3f2]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   7:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x[acme12.ciemat.es:mpi_rank_1][print_backtrace]
  0: /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c)
[0x7f6c0f62d29c]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]   1:
/home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
[0x7f6c0f62d399]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]   2:
/usr/lib64/libc.so.6(+0x35670) [0x7f6c0ed2a670]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]   3:
/home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
[0x7f6c0f625a91]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]   4:
/home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
[0x7f6c0f6260ac]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]   5:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
[0x7f6c0f5fe12f]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]   6:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
[0x7f6c0f5e63f2]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]   7:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x1b2) [0x7fbff1483b82]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   8:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
[0x7fbff1403689]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]   9:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7fbff14030e6]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]  10:
/home/slurm/tests/helloWorldMPI() [0x400881]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]  11:
/usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fbff0bbeb15]
[acme11.ciemat.es:mpi_rank_0][print_backtrace]  12:
/home/slurm/tests/helloWorldMPI() [0x400789]
1b2) [0x7f6c0f5dbb82]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]   8:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
[0x7f6c0f55b689]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]   9:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7f6c0f55b0e6]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]  10:
/home/slurm/tests/helloWorldMPI() [0x400881]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]  11:
/usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6c0ed16b15]
[acme12.ciemat.es:mpi_rank_1][print_backtrace]  12:
/home/slurm/tests/helloWorldMPI() [0x400789]
helloWorldMPISrun.sh: línea 5: 26539 Violación de segmento  (`core'
generado) /home/slurm/tests/helloWorldMPI                  <-------- this
means "segmentation fault"
helloWorldMPISrun.sh: línea 5: 27743 Violación de segmento  (`core'
generado) /home/slurm/tests/helloWorldMPI
srun: error: acme12: task 1: Exited with exit code 139
srun: error: acme11: task 0: Exited with exit code 139


In case that it helps, config.log is attached.

Any clue on what's going on? Also, if there is a version issue or whatever,
I have no problem on downgrading my system to a particular slurm and
mvapich versions known to work together. However I feel that the problem is
probably something else, as I am quite newbie on this and I am probably
doing some obvious stuff wrong.

Thanks for your help,


Manuel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161212/e6fc092d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.log
Type: text/x-log
Size: 744156 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161212/e6fc092d/attachment-0001.bin>


More information about the mvapich-discuss mailing list