[mvapich-discuss] MVAPICH2 2.x with PMI2 and SLURM (was: Re: problem with MVAPICH+Slurm)

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Tue Apr 4 16:35:20 EDT 2017


Hello Ryan,

This issue has been fixed in the latest release of MVAPICH2-2.3a.

Setting MV2_ON_DEMAND_THRESHOLD=1 was suggested as a workaround. It enables
On demand connection setup for all process sizes and does not affect
performance. You won't need to use the workaround with the latest release.

Thanks,
Sourav


On Tue, Apr 4, 2017 at 3:16 PM, Ryan Novosielski <novosirj at rutgers.edu>
wrote:

> Hi there,
>
> This thread recently saved me after upgrading from running SLURM 15.08.x
> (I don’t recall exactly what version, maybe 15.08.4) that was built from
> the RPM spec provided in the SLURM tar file to using the OpenHPC-provided
> SLURM 16.05.5 build. I found that I got a segmentation fault without the
> parameter (I personally added it to /etc/mvapich2.conf on all of my compute
> nodes).
>
> From looking at this previous thread, this was a “try this” and it was
> going to be investigated. Do we know why this is happening, and if there is
> any negative impact to setting MV2_ON_DEMAND_THRESHOLD=1? I want to make
> sure I’m not setting myself up for a different problem down the road, or
> that there is not now a better solution to this.
>
> Thanks!
>
> > On Dec 13, 2016, at 5:26 AM, Manuel Rodríguez Pascual <
> manuel.rodriguez.pascual at gmail.com> wrote:
> >
> > yes, it does work now :)
> >
> > Thanks very much for your help, Sourav
> >
> >
> >
> > 2016-12-12 22:22 GMT+01:00 Sourav Chakraborty <
> chakraborty.52 at buckeyemail.osu.edu>:
> > Hi Manuel,
> >
> > Thanks for reporting the issue. We are investigating it.
> >
> > Can you please try setting "export MV2_ON_DEMAND_THRESHOLD=1" and see if
> it solves the issue?
> >
> > Thanks,
> > Sourav
> >
> >
> > On Mon, Dec 12, 2016 at 10:59 AM, Manuel Rodríguez Pascual <
> manuel.rodriguez.pascual at gmail.com> wrote:
> > Hi all,
> >
> > I am trying to configure mvapich and Slurm to work together. I keep
> however having some problems. I feel that it must be something pretty
> obvious, but I just cannot find the issue. My final objective is that
> mvapich employes Slurm resource manager.
> >
> > I am using mvapich2-2.2 and slurm 15.08.12 (although the same problem
> arises with newer versions)
> >
> > As a first test, compiling with:
> >
> > ./configure --prefix=/home/localsoft/mvapich2 --with-pm=mpirun:hydra--
> disable-mcast
> >
> > This works OK, but uses hydra (that I don't want to)
> >
> > However, when compiling following mvapich manual it does not work as
> expected
> >
> > ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
> --with-slurm=/home/localsoft/slurm --with-pmi=slurm --with-pm=none
> -enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo
> --enable-mpit-pvars=all --enable-check-compiler-flags
> --enable-threads=multiple --enable-weak-symbols  --enable-fast-install
> --enable-g=dbg --enable-error-messages=all --enable-error-checking=all
> >
> > mpicc  helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib
> >
> > This works fine when I am running jobs on a single node. ie:
> > -bash-4.2$ srun -n 2 --tasks-per-node=2 helloWorldMPISrun.sh
> > Process 0 of 2 is on acme11.ciemat.es
> > Process 1 of 2 is on acme11.ciemat.es
> > Hello world from process 0 of 2
> > Hello world from process 1 of 2
> >
> >
> > But when I try to run it on two nodes, it crashes. I have included
> "export MV2_DEBUG_SHOW_BACKTRACE=1" in my script for debbuging info
> >
> >
> > -bash-4.2$ srun -n 2 --tasks-per-node=1 helloWorldMPISrun.sh
> > [acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error:
> Segmentation fault (signal 11)
> > [acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error:
> Segmentation fault (signal 11)
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   0:
> /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c)
> [0x7fbff14d529c]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   1:
> /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
> [0x7fbff14d5399]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   2:
> /usr/lib64/libc.so.6(+0x35670) [0x7fbff0bd2670]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   3:
> /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
> [0x7fbff14cda91]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   4:
> /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
> [0x7fbff14ce0ac]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   5:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
> [0x7fbff14a612f]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   6:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
> [0x7fbff148e3f2]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   7:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x[acme12.ciemat.es:
> mpi_rank_1][print_backtrace]   0: /home/localsoft/mvapich2/lib/
> libmpi.so.12(print_backtrace+0x1c) [0x7f6c0f62d29c]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   1:
> /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
> [0x7f6c0f62d399]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   2:
> /usr/lib64/libc.so.6(+0x35670) [0x7f6c0ed2a670]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   3:
> /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
> [0x7f6c0f625a91]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   4:
> /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
> [0x7f6c0f6260ac]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   5:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
> [0x7f6c0f5fe12f]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   6:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
> [0x7f6c0f5e63f2]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   7:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x1b2)
> [0x7fbff1483b82]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   8:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
> [0x7fbff1403689]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   9:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7fbff14030e6]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]  10: /home/slurm/tests/helloWorldMPI()
> [0x400881]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]  11:
> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fbff0bbeb15]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]  12: /home/slurm/tests/helloWorldMPI()
> [0x400789]
> > 1b2) [0x7f6c0f5dbb82]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   8:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
> [0x7f6c0f55b689]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   9:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7f6c0f55b0e6]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]  10: /home/slurm/tests/helloWorldMPI()
> [0x400881]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]  11:
> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6c0ed16b15]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]  12: /home/slurm/tests/helloWorldMPI()
> [0x400789]
> > helloWorldMPISrun.sh: línea 5: 26539 Violación de segmento  (`core'
> generado) /home/slurm/tests/helloWorldMPI                  <-------- this
> means "segmentation fault"
> > helloWorldMPISrun.sh: línea 5: 27743 Violación de segmento  (`core'
> generado) /home/slurm/tests/helloWorldMPI
> > srun: error: acme12: task 1: Exited with exit code 139
> > srun: error: acme11: task 0: Exited with exit code 139
> >
> >
> > In case that it helps, config.log is attached.
> >
> > Any clue on what's going on? Also, if there is a version issue or
> whatever, I have no problem on downgrading my system to a particular slurm
> and mvapich versions known to work together. However I feel that the
> problem is probably something else, as I am quite newbie on this and I am
> probably doing some obvious stuff wrong.
> >
> > Thanks for your help,
> >
> >
> > Manuel
> >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> || \\UTGERS      |---------------------*O*---------------------
> ||_// Biomedical | Ryan Novosielski - Senior Technologist
> || \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
> ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
>      `'
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170404/0c23f474/attachment-0001.html>


More information about the mvapich-discuss mailing list