[mvapich-discuss] MVAPICH2 2.x with PMI2 and SLURM (was: Re: problem with MVAPICH+Slurm)

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Sun Apr 9 14:45:49 EDT 2017


Hi Ryan,

You can look at the following paper for more details on the on-demand
connection management.

@inproceedings{wu2002impact,
  title={Impact of on-demand connection management in MPI over VIA},
  author={Wu, Jiesheng and Liu, Jiuxing and Wyckoff, Pete and Panda,
Dhabaleswar},
  booktitle={Cluster Computing, 2002. Proceedings. 2002 IEEE International
Conference on},
  pages={152--159},
  year={2002},
  organization={IEEE}
}

All of MVAPICH2 releases including "alpha" are tested rigorously.

Thanks,
Sourav



On Fri, Apr 7, 2017 at 12:54 PM, Ryan Novosielski <novosirj at rutgers.edu>
wrote:

> Thanks, Sourav. I’ve been looking for more information on what exactly
> on-demand connection setup means. Can you point me to anything that will be
> informative?
>
> Also, the version 2.3a — is the “a” indicative of alpha/pre-release (as in
> it’s not yet stable)?
>
> Thanks again.
>
> --
> ____
> || \\UTGERS,     |---------------------------*
> O*---------------------------
> ||_// the State  |         Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> ||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
>     `'
>
> > On Apr 4, 2017, at 16:35, Sourav Chakraborty <
> chakraborty.52 at buckeyemail.osu.edu> wrote:
> >
> > Hello Ryan,
> >
> > This issue has been fixed in the latest release of MVAPICH2-2.3a.
> >
> > Setting MV2_ON_DEMAND_THRESHOLD=1 was suggested as a workaround. It
> enables On demand connection setup for all process sizes and does not
> affect performance. You won't need to use the workaround with the latest
> release.
> >
> > Thanks,
> > Sourav
> >
> >
> > On Tue, Apr 4, 2017 at 3:16 PM, Ryan Novosielski <novosirj at rutgers.edu>
> wrote:
> > Hi there,
> >
> > This thread recently saved me after upgrading from running SLURM 15.08.x
> (I don’t recall exactly what version, maybe 15.08.4) that was built from
> the RPM spec provided in the SLURM tar file to using the OpenHPC-provided
> SLURM 16.05.5 build. I found that I got a segmentation fault without the
> parameter (I personally added it to /etc/mvapich2.conf on all of my compute
> nodes).
> >
> > From looking at this previous thread, this was a “try this” and it was
> going to be investigated. Do we know why this is happening, and if there is
> any negative impact to setting MV2_ON_DEMAND_THRESHOLD=1? I want to make
> sure I’m not setting myself up for a different problem down the road, or
> that there is not now a better solution to this.
> >
> > Thanks!
> >
> > > On Dec 13, 2016, at 5:26 AM, Manuel Rodríguez Pascual <
> manuel.rodriguez.pascual at gmail.com> wrote:
> > >
> > > yes, it does work now :)
> > >
> > > Thanks very much for your help, Sourav
> > >
> > >
> > >
> > > 2016-12-12 22:22 GMT+01:00 Sourav Chakraborty <
> chakraborty.52 at buckeyemail.osu.edu>:
> > > Hi Manuel,
> > >
> > > Thanks for reporting the issue. We are investigating it.
> > >
> > > Can you please try setting "export MV2_ON_DEMAND_THRESHOLD=1" and see
> if it solves the issue?
> > >
> > > Thanks,
> > > Sourav
> > >
> > >
> > > On Mon, Dec 12, 2016 at 10:59 AM, Manuel Rodríguez Pascual <
> manuel.rodriguez.pascual at gmail.com> wrote:
> > > Hi all,
> > >
> > > I am trying to configure mvapich and Slurm to work together. I keep
> however having some problems. I feel that it must be something pretty
> obvious, but I just cannot find the issue. My final objective is that
> mvapich employes Slurm resource manager.
> > >
> > > I am using mvapich2-2.2 and slurm 15.08.12 (although the same problem
> arises with newer versions)
> > >
> > > As a first test, compiling with:
> > >
> > > ./configure --prefix=/home/localsoft/mvapich2 --with-pm=mpirun:hydra--
> disable-mcast
> > >
> > > This works OK, but uses hydra (that I don't want to)
> > >
> > > However, when compiling following mvapich manual it does not work as
> expected
> > >
> > > ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
> --with-slurm=/home/localsoft/slurm --with-pmi=slurm --with-pm=none
> -enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo
> --enable-mpit-pvars=all --enable-check-compiler-flags
> --enable-threads=multiple --enable-weak-symbols  --enable-fast-install
> --enable-g=dbg --enable-error-messages=all --enable-error-checking=all
> > >
> > > mpicc  helloWorldMPI.c -o helloWorldMPI -L/home/localsoft/slurm/lib
> > >
> > > This works fine when I am running jobs on a single node. ie:
> > > -bash-4.2$ srun -n 2 --tasks-per-node=2 helloWorldMPISrun.sh
> > > Process 0 of 2 is on acme11.ciemat.es
> > > Process 1 of 2 is on acme11.ciemat.es
> > > Hello world from process 0 of 2
> > > Hello world from process 1 of 2
> > >
> > >
> > > But when I try to run it on two nodes, it crashes. I have included
> "export MV2_DEBUG_SHOW_BACKTRACE=1" in my script for debbuging info
> > >
> > >
> > > -bash-4.2$ srun -n 2 --tasks-per-node=1 helloWorldMPISrun.sh
> > > [acme12.ciemat.es:mpi_rank_1][error_sighandler] Caught error:
> Segmentation fault (signal 11)
> > > [acme11.ciemat.es:mpi_rank_0][error_sighandler] Caught error:
> Segmentation fault (signal 11)
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   0:
> /home/localsoft/mvapich2/lib/libmpi.so.12(print_backtrace+0x1c)
> [0x7fbff14d529c]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   1:
> /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
> [0x7fbff14d5399]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   2:
> /usr/lib64/libc.so.6(+0x35670) [0x7fbff0bd2670]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   3:
> /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
> [0x7fbff14cda91]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   4:
> /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
> [0x7fbff14ce0ac]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   5:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
> [0x7fbff14a612f]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   6:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
> [0x7fbff148e3f2]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   7:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x[acme12.ciemat.es:
> mpi_rank_1][print_backtrace]   0: /home/localsoft/mvapich2/lib/
> libmpi.so.12(print_backtrace+0x1c) [0x7f6c0f62d29c]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   1:
> /home/localsoft/mvapich2/lib/libmpi.so.12(error_sighandler+0x59)
> [0x7f6c0f62d399]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   2:
> /usr/lib64/libc.so.6(+0x35670) [0x7f6c0ed2a670]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   3:
> /home/localsoft/mvapich2/lib/libmpi.so.12(_ring_boot_exchange+0x281)
> [0x7f6c0f625a91]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   4:
> /home/localsoft/mvapich2/lib/libmpi.so.12(rdma_ring_boot_exchange+0x7c)
> [0x7f6c0f6260ac]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   5:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3I_RDMA_init+0x78f)
> [0x7f6c0f5fe12f]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   6:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIDI_CH3_Init+0x842)
> [0x7f6c0f5e63f2]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   7:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPID_Init+0x1b2)
> [0x7fbff1483b82]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   8:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
> [0x7fbff1403689]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]   9:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7fbff14030e6]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]  10: /home/slurm/tests/helloWorldMPI()
> [0x400881]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]  11:
> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fbff0bbeb15]
> > > [acme11.ciemat.es:mpi_rank_0][print_backtrace]  12: /home/slurm/tests/helloWorldMPI()
> [0x400789]
> > > 1b2) [0x7f6c0f5dbb82]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   8:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPIR_Init_thread+0x2b9)
> [0x7f6c0f55b689]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]   9:
> /home/localsoft/mvapich2/lib/libmpi.so.12(MPI_Init+0x86) [0x7f6c0f55b0e6]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]  10: /home/slurm/tests/helloWorldMPI()
> [0x400881]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]  11:
> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6c0ed16b15]
> > > [acme12.ciemat.es:mpi_rank_1][print_backtrace]  12: /home/slurm/tests/helloWorldMPI()
> [0x400789]
> > > helloWorldMPISrun.sh: línea 5: 26539 Violación de segmento  (`core'
> generado) /home/slurm/tests/helloWorldMPI                  <-------- this
> means "segmentation fault"
> > > helloWorldMPISrun.sh: línea 5: 27743 Violación de segmento  (`core'
> generado) /home/slurm/tests/helloWorldMPI
> > > srun: error: acme12: task 1: Exited with exit code 139
> > > srun: error: acme11: task 0: Exited with exit code 139
> > >
> > >
> > > In case that it helps, config.log is attached.
> > >
> > > Any clue on what's going on? Also, if there is a version issue or
> whatever, I have no problem on downgrading my system to a particular slurm
> and mvapich versions known to work together. However I feel that the
> problem is probably something else, as I am quite newbie on this and I am
> probably doing some obvious stuff wrong.
> > >
> > > Thanks for your help,
> > >
> > >
> > > Manuel
> > >
> > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> > ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> > || \\UTGERS      |---------------------*O*---------------------
> > ||_// Biomedical | Ryan Novosielski - Senior Technologist
> > || \\ and Health | novosirj at rutgers.edu - 973/972.0922 (2x0922)
> > ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
> >      `'
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170409/bec430e3/attachment-0001.html>


More information about the mvapich-discuss mailing list