[mvapich-discuss] problem with MVAPICH+Slurm

Manuel Rodríguez Pascual manuel.rodriguez.pascual at gmail.com
Tue Dec 13 05:26:52 EST 2016


yes, it does work now :)

Thanks very much for your help, Sourav



2016-12-12 22:22 GMT+01:00 Sourav Chakraborty <
chakraborty.52 at buckeyemail.osu.edu>:

> Hi Manuel,
>
> Thanks for reporting the issue. We are investigating it.
>
> Can you please try setting "export MV2_ON_DEMAND_THRESHOLD=1" and see if
> it solves the issue?
>
> Thanks,
> Sourav
>
>
> On Mon, Dec 12, 2016 at 10:59 AM, Manuel Rodríguez Pascual <
> manuel.rodriguez.pascual at gmail.com> wrote:
>
>> Hi all,
>>
>> I am trying to configure mvapich and Slurm to work together. I keep
>> however having some problems. I feel that it must be something pretty
>> obvious, but I just cannot find the issue. My final objective is that
>> mvapich employes Slurm resource manager.
>>
>> I am using mvapich2-2.2 and slurm 15.08.12 (although the same problem
>> arises with newer versions)
>>
>> As a first test, compiling with:
>>
>> ./configure --prefix=/home/localsoft/mvapich2
>> --with-pm=mpirun:hydra--disable-mcast
>>
>> This works OK, but uses hydra (that I don't want to)
>>
>> However, when compiling following mvapich manual it does not work as
>> expected
>>
>> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
>> --with-slurm=/home/localsoft/slurm --with-pmi=slurm --with-pm=none
>> -enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo
>> --enable-mpit-pvars=all --enable-check-compiler-flags
>> --enable-threads=multiple --enable-weak-symbols  --enable-fast-install
>> --enable-g=dbg --enable-error-messages=all --enable-error-checking=all
>>
>> mpicc  helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib
>>
>> This works fine when I am running jobs on a single node. ie:
>> -bash-4.2$ srun -n 2 --tasks-per-node=2 helloWorldMPISrun.sh
>> Process 0 of 2 is on acme11.ciemat.es
>> Process 1 of 2 is on acme11.ciemat.es
>> Hello world from process 0 of 2
>> Hello world from process 1 of 2
>>
>>
>> But when I try to run it on two nodes, it crashes. I have included
>> "export MV2_DEBUG_SHOW_BACKTRACE=1" in my script for debbuging info
>>
>>
>> -bash-4.2$ srun -n 2 --tasks-per-node=1 helloWorldMPISrun.sh
>> [acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error:
>> Segmentation fault (signal 11)
>> [acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error:
>> Segmentation fault (signal 11)
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   0:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c)
>> [0x7fbff14d529c]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   1:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
>> [0x7fbff14d5399]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   2:
>> /usr/lib64/libc.so.6(+0x35670) [0x7fbff0bd2670]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   3:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
>> [0x7fbff14cda91]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   4:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
>> [0x7fbff14ce0ac]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   5:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
>> [0x7fbff14a612f]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   6:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
>> [0x7fbff148e3f2]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   7:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x[acme12.ciemat.es:
>> mpi_rank_1][print_backtrace]   0: /home/localsoft/mvapich2/lib/l
>> ibmpi.so.12(print_backtrace+0x1c) [0x7f6c0f62d29c]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   1:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
>> [0x7f6c0f62d399]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   2:
>> /usr/lib64/libc.so.6(+0x35670) [0x7f6c0ed2a670]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   3:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
>> [0x7f6c0f625a91]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   4:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
>> [0x7f6c0f6260ac]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   5:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
>> [0x7f6c0f5fe12f]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   6:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
>> [0x7f6c0f5e63f2]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   7:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x1b2)
>> [0x7fbff1483b82]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   8:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
>> [0x7fbff1403689]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   9:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7fbff14030e6]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]  10:
>> /home/slurm/tests/helloWorldMPI() [0x400881]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]  11:
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fbff0bbeb15]
>> [acme11.ciemat.es:mpi_rank_0][print_backtrace]  12:
>> /home/slurm/tests/helloWorldMPI() [0x400789]
>> 1b2) [0x7f6c0f5dbb82]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   8:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
>> [0x7f6c0f55b689]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   9:
>> /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7f6c0f55b0e6]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]  10:
>> /home/slurm/tests/helloWorldMPI() [0x400881]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]  11:
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6c0ed16b15]
>> [acme12.ciemat.es:mpi_rank_1][print_backtrace]  12:
>> /home/slurm/tests/helloWorldMPI() [0x400789]
>> helloWorldMPISrun.sh: línea 5: 26539 Violación de segmento  (`core'
>> generado) /home/slurm/tests/helloWorldMPI                  <--------
>> this means "segmentation fault"
>> helloWorldMPISrun.sh: línea 5: 27743 Violación de segmento  (`core'
>> generado) /home/slurm/tests/helloWorldMPI
>> srun: error: acme12: task 1: Exited with exit code 139
>> srun: error: acme11: task 0: Exited with exit code 139
>>
>>
>> In case that it helps, config.log is attached.
>>
>> Any clue on what's going on? Also, if there is a version issue or
>> whatever, I have no problem on downgrading my system to a particular slurm
>> and mvapich versions known to work together. However I feel that the
>> problem is probably something else, as I am quite newbie on this and I am
>> probably doing some obvious stuff wrong.
>>
>> Thanks for your help,
>>
>>
>> Manuel
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161213/d110427b/attachment-0001.html>


More information about the mvapich-discuss mailing list