[mvapich-discuss] MVAPICH2 2.x with PMI2 and SLURM (was: Re: problem with MVAPICH+Slurm)

Fri Apr 7 12:54:03 EDT 2017

Thanks, Sourav. I’ve been looking for more information on what exactly on-demand connection setup means. Can you point me to anything that will be informative? 

Also, the version 2.3a — is the “a” indicative of alpha/pre-release (as in it’s not yet stable)?

Thanks again.

--
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
    `'

> On Apr 4, 2017, at 16:35, Sourav Chakraborty <chakraborty.52 at buckeyemail.osu.edu> wrote:
> 
> Hello Ryan,
> 
> This issue has been fixed in the latest release of MVAPICH2-2.3a.
> 
> Setting MV2_ON_DEMAND_THRESHOLD=1 was suggested as a workaround. It enables On demand connection setup for all process sizes and does not affect performance. You won't need to use the workaround with the latest release.
> 
> Thanks,
> Sourav
> 
> 
> On Tue, Apr 4, 2017 at 3:16 PM, Ryan Novosielski <novosirj at rutgers.edu> wrote:
> Hi there,
> 
> This thread recently saved me after upgrading from running SLURM 15.08.x (I don’t recall exactly what version, maybe 15.08.4) that was built from the RPM spec provided in the SLURM tar file to using the OpenHPC-provided SLURM 16.05.5 build. I found that I got a segmentation fault without the parameter (I personally added it to /etc/mvapich2.conf on all of my compute nodes).
> 
> From looking at this previous thread, this was a “try this” and it was going to be investigated. Do we know why this is happening, and if there is any negative impact to setting MV2_ON_DEMAND_THRESHOLD=1? I want to make sure I’m not setting myself up for a different problem down the road, or that there is not now a better solution to this.
> 
> Thanks!
> 
> > On Dec 13, 2016, at 5:26 AM, Manuel Rodríguez Pascual <manuel.rodriguez.pascual at gmail.com> wrote:
> >
> > yes, it does work now :)
> >
> > Thanks very much for your help, Sourav
> >
> >
> >
> > 2016-12-12 22:22 GMT+01:00 Sourav Chakraborty <chakraborty.52 at buckeyemail.osu.edu>:
> > Hi Manuel,
> >
> > Thanks for reporting the issue. We are investigating it.
> >
> > Can you please try setting "export MV2_ON_DEMAND_THRESHOLD=1" and see if it solves the issue?
> >
> > Thanks,
> > Sourav
> >
> >
> > On Mon, Dec 12, 2016 at 10:59 AM, Manuel Rodríguez Pascual <manuel.rodriguez.pascual at gmail.com> wrote:
> > Hi all,
> >
> > I am trying to configure mvapich and Slurm to work together. I keep however having some problems. I feel that it must be something pretty obvious, but I just cannot find the issue. My final objective is that mvapich employes Slurm resource manager.
> >
> > I am using mvapich2-2.2 and slurm 15.08.12 (although the same problem arises with newer versions)
> >
> > As a first test, compiling with:
> >
> > ./configure --prefix=/home/localsoft/mvapich2 --with-pm=mpirun:hydra--disable-mcast
> >
> > This works OK, but uses hydra (that I don't want to)
> >
> > However, when compiling following mvapich manual it does not work as expected
> >
> > ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast --with-slurm=/home/localsoft/slurm --with-pmi=slurm --with-pm=none -enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo --enable-mpit-pvars=all --enable-check-compiler-flags --enable-threads=multiple --enable-weak-symbols  --enable-fast-install --enable-g=dbg --enable-error-messages=all --enable-error-checking=all
> >
> > mpicc  helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib
> >
> > This works fine when I am running jobs on a single node. ie:
> > -bash-4.2$ srun -n 2 --tasks-per-node=2 helloWorldMPISrun.sh
> > Process 0 of 2 is on acme11.ciemat.es
> > Process 1 of 2 is on acme11.ciemat.es
> > Hello world from process 0 of 2
> > Hello world from process 1 of 2
> >
> >
> > But when I try to run it on two nodes, it crashes. I have included "export MV2_DEBUG_SHOW_BACKTRACE=1" in my script for debbuging info
> >
> >
> > -bash-4.2$ srun -n 2 --tasks-per-node=1 helloWorldMPISrun.sh
> > [acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
> > [acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   0: /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c) [0x7fbff14d529c]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   1: /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59) [0x7fbff14d5399]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   2: /usr/lib64/libc.so.6(+0x35670) [0x7fbff0bd2670]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   3: /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281) [0x7fbff14cda91]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   4: /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c) [0x7fbff14ce0ac]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   5: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f) [0x7fbff14a612f]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   6: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842) [0x7fbff148e3f2]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   7: /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x[acme12.ciemat.es:mpi_rank_1][print_backtrace]   0: /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c) [0x7f6c0f62d29c]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   1: /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59) [0x7f6c0f62d399]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   2: /usr/lib64/libc.so.6(+0x35670) [0x7f6c0ed2a670]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   3: /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281) [0x7f6c0f625a91]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   4: /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c) [0x7f6c0f6260ac]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   5: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f) [0x7f6c0f5fe12f]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   6: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842) [0x7f6c0f5e63f2]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   7: /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x1b2) [0x7fbff1483b82]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   8: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9) [0x7fbff1403689]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   9: /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7fbff14030e6]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]  10: /home/slurm/tests/helloWorldMPI() [0x400881]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]  11: /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fbff0bbeb15]
> > [acme11.ciemat.es:mpi_rank_0][print_backtrace]  12: /home/slurm/tests/helloWorldMPI() [0x400789]
> > 1b2) [0x7f6c0f5dbb82]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   8: /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9) [0x7f6c0f55b689]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   9: /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7f6c0f55b0e6]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]  10: /home/slurm/tests/helloWorldMPI() [0x400881]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]  11: /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6c0ed16b15]
> > [acme12.ciemat.es:mpi_rank_1][print_backtrace]  12: /home/slurm/tests/helloWorldMPI() [0x400789]
> > helloWorldMPISrun.sh: línea 5: 26539 Violación de segmento  (`core' generado) /home/slurm/tests/helloWorldMPI                  <-------- this means "segmentation fault"
> > helloWorldMPISrun.sh: línea 5: 27743 Violación de segmento  (`core' generado) /home/slurm/tests/helloWorldMPI
> > srun: error: acme12: task 1: Exited with exit code 139
> > srun: error: acme11: task 0: Exited with exit code 139
> >
> >
> > In case that it helps, config.log is attached.
> >
> > Any clue on what's going on? Also, if there is a version issue or whatever, I have no problem on downgrading my system to a particular slurm and mvapich versions known to work together. However I feel that the problem is probably something else, as I am quite newbie on this and I am probably doing some obvious stuff wrong.
> >
> > Thanks for your help,
> >
> >
> > Manuel
> >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> || \\UTGERS      |---------------------*O*---------------------
> ||_// Biomedical | Ryan Novosielski - Senior Technologist
> || \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
> ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
>      `'
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
>