[mvapich-discuss] Strange behavior of MPI_Bcast in MVAPICH2 2.0 GDR

Akshay Venkatesh venkatesh.19 at buckeyemail.osu.edu
Mon Oct 20 14:02:10 EDT 2014


Hi Jens,

Thanks for the report. We're taking a look at this now and will to get back
to you soon.

On Sun, Oct 19, 2014 at 6:32 PM, Jens Glaser <jsglaser at umich.edu> wrote:

> Dear developers,
>
> I ran into weird behavior of MPI_Bcast in MVAPICH 2.0 GDR, if that job
> runs on a single node only
> (-np 20, 2x 10core E5-2690). The program crashes inside a MPI_Bcast (of an
> unsigned int from rank 0),
> with a segfault. Note the unsigned int is a stack variable (not a GPU
> buffer).
>
> Here’s the output (rank 0 only)
> [ivb115:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [ivb115:mpi_rank_0][print_backtrace]   0:
> /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(print_backtrace+0x23)
> [0x7fdb16bcaab3]
> [ivb115:mpi_rank_0][print_backtrace]   1:
> /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(error_sighandler+0x5e)
> [0x7fdb16bcabce]
> [ivb115:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0(+0xf710)
> [0x7fdb1a5fb710]
> [ivb115:mpi_rank_0][print_backtrace]   3:
> /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x1c0)
> [0x7fdb16c534f0]
> [ivb115:mpi_rank_0][print_backtrace]   4:
> /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(MPIR_Bcast_MV2+0xdf)
> [0x7fdb16c517bf]
> [ivb115:mpi_rank_0][print_backtrace]   5:
> /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(MPIR_Bcast_impl+0x1b)
> [0x7fdb16bf263b]
> [ivb115:mpi_rank_0][print_backtrace]   6:
> /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(MPI_Bcast+0x4d7)
> [0x7fdb16bf2bf7]
> [ivb115:mpi_rank_0][print_backtrace]   7:
> /home-2/jglaser/hoomd-install/bin/../lib/hoomd/python-module/hoomd.so(_Z5bcastIjEvRT_ji+0x97)
> [0x7fdb1c307997]
> [ivb115:mpi_rank_0][print_backtrace]   8:
> /home-2/jglaser/hoomd-install/bin/../lib/hoomd/python-module/hoomd.so(_ZN19DomainDecompositionC2EN5boost10shared_ptrI22ExecutionConfigurationEE6float3jjjb+0xc6e)
> [0x7fdb1c302b7e]
> [ivb115:mpi_rank_0][print_backtrace]   9:
> /home-2/jglaser/hoomd-install/bin/../lib/hoomd/python-module/hoomd.so(_ZN5boost6python7objects11make_holderILi6EE5applyINS1_14pointer_holderINS_10shared_ptrI19DomainDecompositionEES7_EENS_3mpl7vector6INS6_I22ExecutionConfigurationEE6float3jjjbEEE7executeEP7_objectSD_SE_jjjb+0xd3)
> [0x7fdb1c3091e3]
> [ivb115:mpi_rank_0][print_backtrace]  10:
> /home-2/jglaser/hoomd-install/bin/../lib/hoomd/python-module/hoomd.so(_ZN5boost6python7objects23caller_py_function_implINS0_6detail6callerIPFvP7_objectNS_10shared_ptrI22ExecutionConfigurationEE6float3jjjbENS0_21default_call_policiesENS_3mpl7vector8IvS6_S9_SA_jjjbEEEEEclES6_S6_+0x308)
> [0x7fdb1c304c08]
> [ivb115:mpi_rank_0][print_backtrace]  11:
> /home-2/jglaser/local/lib/libboost_python3.so.1.55.0(_ZNK5boost6python7objects8function4callEP7_objectS4_+0xca)
> [0x7fdb1ac9f77a]
> etc.
>
> It runs well on more than 20 ranks (i.e. inter-node). Also, it runs fine
> on a single node
> if I set MV2_USE_SHMEM_COLL=0, to disable the intra-node optimizations.
>
> Please let me know if you need any more information.
>
> best
> Jens
>
> P.S.: Version information:
>
> MVAPICH2 Version:     2.0
> MVAPICH2 Release date: Fri Jun 20 20:00:00 EDT 2014
> MVAPICH2 Device:      ch3:mrail
> MVAPICH2 configure:   --build=x86_64-unknown-linux-gnu
> --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu
> --program-prefix= --prefix=/opt/mvapich2/gdr/2.0/gnu
> --exec-prefix=/opt/mvapich2/gdr/2.0/gnu
> --bindir=/opt/mvapich2/gdr/2.0/gnu/bin
> --sbindir=/opt/mvapich2/gdr/2.0/gnu/sbin
> --sysconfdir=/opt/mvapich2/gdr/2.0/gnu/etc
> --datadir=/opt/mvapich2/gdr/2.0/gnu/share
> --includedir=/opt/mvapich2/gdr/2.0/gnu/include
> --libdir=/opt/mvapich2/gdr/2.0/gnu/lib64
> --libexecdir=/opt/mvapich2/gdr/2.0/gnu/libexec --localstatedir=/var
> --sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/2.0/gnu/share/man
> --infodir=/opt/mvapich2/gdr/2.0/gnu/share/info --disable-rpath
> --disable-static --enable-shared --disable-rdma-cm --disable-mcast
> --enable-cuda --without-hydra-ckpointlib CPPFLAGS=-I/usr/local/cuda/include
> LDFLAGS=-L/usr/local/cuda/lib64 -Wl,-rpath,/usr/local/cuda/lib64
> -Wl,-rpath,XORIGIN/placeholder
> MVAPICH2 CC:  gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
> -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic   -DNDEBUG
> -DNVALGRIND -O2
> MVAPICH2 CXX: g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
> -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic  -DNDEBUG
> -DNVALGRIND
> MVAPICH2 F77: gfortran -L/lib -L/lib -O2 -g -pipe -Wall
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> --param=ssp-buffer-size=4 -m64 -mtune=generic
> -I/opt/mvapich2/gdr/2.0/gnu/lib64/gfortran/modules  -O2
> MVAPICH2 FC:  gfortran
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
- Akshay

http://www.cse.ohio-state.edu/~akshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141020/a5ef9cac/attachment.html>


More information about the mvapich-discuss mailing list