[mvapich-discuss] MVAPICH2 2.x with PMI2 and SLURM (was: Re: problem with MVAPICH+Slurm)

Ryan Novosielski novosirj at rutgers.edu
Tue Apr 4 15:16:32 EDT 2017


Hi there,

This thread recently saved me after upgrading from running SLURM 15.08.x (I don’t recall exactly what version, maybe 15.08.4) that was built from the RPM spec provided in the SLURM tar file to using the OpenHPC-provided SLURM 16.05.5 build. I found that I got a segmentation fault without the parameter (I personally added it to /etc/mvapich2.conf on all of my compute nodes).

From looking at this previous thread, this was a “try this” and it was going to be investigated. Do we know why this is happening, and if there is any negative impact to setting MV2_ON_DEMAND_THRESHOLD=1? I want to make sure I’m not setting myself up for a different problem down the road, or that there is not now a better solution to this.

Thanks!

> On Dec 13, 2016, at 5:26 AM, Manuel Rodríguez Pascual <manuel.rodriguez.pascual at gmail.com> wrote:
> 
> yes, it does work now :)
> 
> Thanks very much for your help, Sourav
> 
> 
> 
> 2016-12-12 22:22 GMT+01:00 Sourav Chakraborty <chakraborty.52 at buckeyemail.osu.edu>:
> Hi Manuel,
> 
> Thanks for reporting the issue. We are investigating it.
> 
> Can you please try setting "export MV2_ON_DEMAND_THRESHOLD=1" and see if it solves the issue?
> 
> Thanks,
> Sourav
> 
> 
> On Mon, Dec 12, 2016 at 10:59 AM, Manuel Rodríguez Pascual <manuel.rodriguez.pascual at gmail.com> wrote:
> Hi all,
> 
> I am trying to configure mvapich and Slurm to work together. I keep however having some problems. I feel that it must be something pretty obvious, but I just cannot find the issue. My final objective is that mvapich employes Slurm resource manager.
> 
> I am using mvapich2-2.2 and slurm 15.08.12 (although the same problem arises with newer versions)
> 
> As a first test, compiling with:
> 
> ./configure --prefix=/home/localsoft/mvapich2 --with-pm=mpirun:hydra--disable-mcast
> 
> This works OK, but uses hydra (that I don't want to)
> 
> However, when compiling following mvapich manual it does not work as expected
> 
> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast --with-slurm=/home/localsoft/slurm --with-pmi=slurm --with-pm=none -enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo --enable-mpit-pvars=all --enable-check-compiler-flags --enable-threads=multiple --enable-weak-symbols  --enable-fast-install --enable-g=dbg --enable-error-messages=all --enable-error-checking=all
> 
> mpicc  helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib
> 
> This works fine when I am running jobs on a single node. ie:
> -bash-4.2$ srun -n 2 --tasks-per-node=2 helloWorldMPISrun.sh
> Process 0 of 2 is on acme11.ciemat.es
> Process 1 of 2 is on acme11.ciemat.es
> Hello world from process 0 of 2
> Hello world from process 1 of 2
> 
> 
> But when I try to run it on two nodes, it crashes. I have included "export MV2_DEBUG_SHOW_BACKTRACE=1" in my script for debbuging info
> 
> 
> -bash-4.2$ srun -n 2 --tasks-per-node=1 helloWorldMPISrun.sh
> [acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
> [acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   0: /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c) [0x7fbff14d529c]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   1: /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59) [0x7fbff14d5399]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   2: /usr/lib64/libc.so.6(+0x35670) [0x7fbff0bd2670]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   3: /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281) [0x7fbff14cda91]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   4: /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c) [0x7fbff14ce0ac]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   5: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f) [0x7fbff14a612f]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   6: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842) [0x7fbff148e3f2]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   7: /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x[acme12.ciemat.es:mpi_rank_1][print_backtrace]   0: /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c) [0x7f6c0f62d29c]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   1: /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59) [0x7f6c0f62d399]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   2: /usr/lib64/libc.so.6(+0x35670) [0x7f6c0ed2a670]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   3: /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281) [0x7f6c0f625a91]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   4: /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c) [0x7f6c0f6260ac]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   5: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f) [0x7f6c0f5fe12f]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   6: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842) [0x7f6c0f5e63f2]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   7: /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x1b2) [0x7fbff1483b82]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   8: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9) [0x7fbff1403689]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]   9: /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7fbff14030e6]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]  10: /home/slurm/tests/helloWorldMPI() [0x400881]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]  11: /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fbff0bbeb15]
> [acme11.ciemat.es:mpi_rank_0][print_backtrace]  12: /home/slurm/tests/helloWorldMPI() [0x400789]
> 1b2) [0x7f6c0f5dbb82]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   8: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9) [0x7f6c0f55b689]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]   9: /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7f6c0f55b0e6]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]  10: /home/slurm/tests/helloWorldMPI() [0x400881]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]  11: /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6c0ed16b15]
> [acme12.ciemat.es:mpi_rank_1][print_backtrace]  12: /home/slurm/tests/helloWorldMPI() [0x400789]
> helloWorldMPISrun.sh: línea 5: 26539 Violación de segmento  (`core' generado) /home/slurm/tests/helloWorldMPI                  <-------- this means "segmentation fault"
> helloWorldMPISrun.sh: línea 5: 27743 Violación de segmento  (`core' generado) /home/slurm/tests/helloWorldMPI
> srun: error: acme12: task 1: Exited with exit code 139
> srun: error: acme11: task 0: Exited with exit code 139
> 
> 
> In case that it helps, config.log is attached.
> 
> Any clue on what's going on? Also, if there is a version issue or whatever, I have no problem on downgrading my system to a particular slurm and mvapich versions known to work together. However I feel that the problem is probably something else, as I am quite newbie on this and I am probably doing some obvious stuff wrong.
> 
> Thanks for your help,
> 
> 
> Manuel
> 
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS      |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
     `'

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 204 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170404/e5da1bd0/attachment.sig>


More information about the mvapich-discuss mailing list