[mvapich-discuss] problem with MVAPICH+Slurm
Manuel Rodríguez Pascual
manuel.rodriguez.pascual at gmail.com
Mon Dec 12 10:59:40 EST 2016
Hi all,
I am trying to configure mvapich and Slurm to work together. I keep however
having some problems. I feel that it must be something pretty obvious, but
I just cannot find the issue. My final objective is that mvapich employes
Slurm resource manager.
I am using mvapich2-2.2 and slurm 15.08.12 (although the same problem
arises with newer versions)
As a first test, compiling with:
./configure --prefix=/home/localsoft/mvapich2
--with-pm=mpirun:hydra--disable-mcast
This works OK, but uses hydra (that I don't want to)
However, when compiling following mvapich manual it does not work as
expected
./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
--with-slurm=/home/localsoft/slurm --with-pmi=slurm --with-pm=none
-enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo
--enable-mpit-pvars=all --enable-check-compiler-flags
--enable-threads=multiple --enable-weak-symbols --enable-fast-install
--enable-g=dbg --enable-error-messages=all --enable-error-checking=all
mpicc helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib
This works fine when I am running jobs on a single node. ie:
-bash-4.2$ srun -n 2 --tasks-per-node=2 helloWorldMPISrun.sh
Process 0 of 2 is on acme11.ciemat.es
Process 1 of 2 is on acme11.ciemat.es
Hello world from process 0 of 2
Hello world from process 1 of 2
But when I try to run it on two nodes, it crashes. I have included "export
MV2_DEBUG_SHOW_BACKTRACE=1" in my script for debbuging info
-bash-4.2$ srun -n 2 --tasks-per-node=1 helloWorldMPISrun.sh
[acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error: Segmentation
fault (signal 11)
[acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error: Segmentation
fault (signal 11)
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 0:
/home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c)
[0x7fbff14d529c]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 1:
/home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
[0x7fbff14d5399]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 2:
/usr/lib64/libc.so.6(+0x35670) [0x7fbff0bd2670]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 3:
/home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
[0x7fbff14cda91]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 4:
/home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
[0x7fbff14ce0ac]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 5:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
[0x7fbff14a612f]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 6:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
[0x7fbff148e3f2]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 7:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x[acme12.ciemat.es:mpi_rank_1][print_backtrace]
0: /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c)
[0x7f6c0f62d29c]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 1:
/home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
[0x7f6c0f62d399]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 2:
/usr/lib64/libc.so.6(+0x35670) [0x7f6c0ed2a670]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 3:
/home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
[0x7f6c0f625a91]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 4:
/home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
[0x7f6c0f6260ac]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 5:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
[0x7f6c0f5fe12f]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 6:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
[0x7f6c0f5e63f2]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 7:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x1b2) [0x7fbff1483b82]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 8:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
[0x7fbff1403689]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 9:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7fbff14030e6]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 10:
/home/slurm/tests/helloWorldMPI() [0x400881]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 11:
/usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fbff0bbeb15]
[acme11.ciemat.es:mpi_rank_0][print_backtrace] 12:
/home/slurm/tests/helloWorldMPI() [0x400789]
1b2) [0x7f6c0f5dbb82]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 8:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
[0x7f6c0f55b689]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 9:
/home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7f6c0f55b0e6]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 10:
/home/slurm/tests/helloWorldMPI() [0x400881]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 11:
/usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6c0ed16b15]
[acme12.ciemat.es:mpi_rank_1][print_backtrace] 12:
/home/slurm/tests/helloWorldMPI() [0x400789]
helloWorldMPISrun.sh: línea 5: 26539 Violación de segmento (`core'
generado) /home/slurm/tests/helloWorldMPI <-------- this
means "segmentation fault"
helloWorldMPISrun.sh: línea 5: 27743 Violación de segmento (`core'
generado) /home/slurm/tests/helloWorldMPI
srun: error: acme12: task 1: Exited with exit code 139
srun: error: acme11: task 0: Exited with exit code 139
In case that it helps, config.log is attached.
Any clue on what's going on? Also, if there is a version issue or whatever,
I have no problem on downgrading my system to a particular slurm and
mvapich versions known to work together. However I feel that the problem is
probably something else, as I am quite newbie on this and I am probably
doing some obvious stuff wrong.
Thanks for your help,
Manuel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161212/e6fc092d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.log
Type: text/x-log
Size: 744156 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161212/e6fc092d/attachment-0001.bin>
More information about the mvapich-discuss
mailing list