[mvapich-discuss] problem with MVAPICH+Slurm

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Mon Dec 12 16:22:14 EST 2016


Hi Manuel,

Thanks for reporting the issue. We are investigating it.

Can you please try setting "export MV2_ON_DEMAND_THRESHOLD=1" and see if it
solves the issue?

Thanks,
Sourav


On Mon, Dec 12, 2016 at 10:59 AM, Manuel Rodríguez Pascual <
manuel.rodriguez.pascual at gmail.com> wrote:

> Hi all,
>
> I am trying to configure mvapich and Slurm to work together. I keep
> however having some problems. I feel that it must be something pretty
> obvious, but I just cannot find the issue. My final objective is that
> mvapich employes Slurm resource manager.
>
> I am using mvapich2-2.2 and slurm 15.08.12 (although the same problem
> arises with newer versions)
>
> As a first test, compiling with:
>
> ./configure --prefix=/home/localsoft/mvapich2 --with-pm=mpirun:hydra--
> disable-mcast
>
> This works OK, but uses hydra (that I don't want to)
>
> However, when compiling following mvapich manual it does not work as
> expected
>
> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
> --with-slurm=/home/localsoft/slurm --with-pmi=slurm --with-pm=none
> -enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo
> --enable-mpit-pvars=all --enable-check-compiler-flags
> --enable-threads=multiple --enable-weak-symbols  --enable-fast-install
> --enable-g=dbg --enable-error-messages=all --enable-error-checking=all
>
> mpicc  helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib
>
> This works fine when I am running jobs on a single node. ie:
> -bash-4.2$ srun -n 2 --tasks-per-node=2 helloWorldMPISrun.sh
> Process 0 of 2 is on acme11.ciemat.es
> Process 1 of 2 is on acme11.ciemat.es
> Hello world from process 0 of 2
> Hello world from process 1 of 2
>
>
> But when I try to run it on two nodes, it crashes. I have included "export
> MV2_DEBUG_SHOW_BACKTRACE=1" in my script for debbuging info
>
>
> -bash-4.2$ srun -n 2 --tasks-per-node=1 helloWorldMPISrun.sh
> [acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error:
> Segmentation fault (signal 11)
> [acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error:
> Segmentation fault (signal 11)
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   0:
> /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c)
> [0x7fbff14d529c]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   1:
> /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
> [0x7fbff14d5399]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   2:
> /usr/lib64/libc.so.6(+0x35670) [0x7fbff0bd2670]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   3:
> /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
> [0x7fbff14cda91]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   4:
> /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
> [0x7fbff14ce0ac]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   5:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
> [0x7fbff14a612f]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   6:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
> [0x7fbff148e3f2]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   7:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x[acme12.ciemat.es:
> mpi_rank_1][print_backtrace]   0: /home/localsoft/mvapich2/lib/
> libmpi.so.12(print_backtrace+0x1c) [0x7f6c0f62d29c]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   1:
> /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
> [0x7f6c0f62d399]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   2:
> /usr/lib64/libc.so.6(+0x35670) [0x7f6c0ed2a670]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   3:
> /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
> [0x7f6c0f625a91]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   4:
> /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
> [0x7f6c0f6260ac]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   5:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
> [0x7f6c0f5fe12f]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   6:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
> [0x7f6c0f5e63f2]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   7:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x1b2)
> [0x7fbff1483b82]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   8:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
> [0x7fbff1403689]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   9:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7fbff14030e6]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]  10: /home/slurm/tests/helloWorldMPI()
> [0x400881]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]  11:
> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fbff0bbeb15]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]  12: /home/slurm/tests/helloWorldMPI()
> [0x400789]
> 1b2) [0x7f6c0f5dbb82]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   8:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
> [0x7f6c0f55b689]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   9:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7f6c0f55b0e6]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]  10: /home/slurm/tests/helloWorldMPI()
> [0x400881]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]  11:
> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6c0ed16b15]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]  12: /home/slurm/tests/helloWorldMPI()
> [0x400789]
> helloWorldMPISrun.sh: línea 5: 26539 Violación de segmento  (`core'
> generado) /home/slurm/tests/helloWorldMPI                  <-------- this
> means "segmentation fault"
> helloWorldMPISrun.sh: línea 5: 27743 Violación de segmento  (`core'
> generado) /home/slurm/tests/helloWorldMPI
> srun: error: acme12: task 1: Exited with exit code 139
> srun: error: acme11: task 0: Exited with exit code 139
>
>
> In case that it helps, config.log is attached.
>
> Any clue on what's going on? Also, if there is a version issue or
> whatever, I have no problem on downgrading my system to a particular slurm
> and mvapich versions known to work together. However I feel that the
> problem is probably something else, as I am quite newbie on this and I am
> probably doing some obvious stuff wrong.
>
> Thanks for your help,
>
>
> Manuel
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161212/1eff6ed2/attachment.html>


More information about the mvapich-discuss mailing list