[mvapich-discuss] Strange behavior of MPI_Bcast in MVAPICH2 2.0 GDR

Jens Glaser jsglaser at umich.edu
Sun Oct 19 18:32:43 EDT 2014


Dear developers,

I ran into weird behavior of MPI_Bcast in MVAPICH 2.0 GDR, if that job runs on a single node only
(-np 20, 2x 10core E5-2690). The program crashes inside a MPI_Bcast (of an unsigned int from rank 0),
with a segfault. Note the unsigned int is a stack variable (not a GPU buffer).

Here’s the output (rank 0 only)
[ivb115:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[ivb115:mpi_rank_0][print_backtrace]   0: /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(print_backtrace+0x23) [0x7fdb16bcaab3]
[ivb115:mpi_rank_0][print_backtrace]   1: /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(error_sighandler+0x5e) [0x7fdb16bcabce]
[ivb115:mpi_rank_0][print_backtrace]   2: /lib64/libpthread.so.0(+0xf710) [0x7fdb1a5fb710]
[ivb115:mpi_rank_0][print_backtrace]   3: /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(MPIR_Bcast_index_tuned_intra_MV2+0x1c0) [0x7fdb16c534f0]
[ivb115:mpi_rank_0][print_backtrace]   4: /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(MPIR_Bcast_MV2+0xdf) [0x7fdb16c517bf]
[ivb115:mpi_rank_0][print_backtrace]   5: /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(MPIR_Bcast_impl+0x1b) [0x7fdb16bf263b]
[ivb115:mpi_rank_0][print_backtrace]   6: /shared/apps/rhel-6.2/mpi/gnu/mvapich2-gdr-2.0/cuda6.5/lib64/libmpich.so.12(MPI_Bcast+0x4d7) [0x7fdb16bf2bf7]
[ivb115:mpi_rank_0][print_backtrace]   7: /home-2/jglaser/hoomd-install/bin/../lib/hoomd/python-module/hoomd.so(_Z5bcastIjEvRT_ji+0x97) [0x7fdb1c307997]
[ivb115:mpi_rank_0][print_backtrace]   8: /home-2/jglaser/hoomd-install/bin/../lib/hoomd/python-module/hoomd.so(_ZN19DomainDecompositionC2EN5boost10shared_ptrI22ExecutionConfigurationEE6float3jjjb+0xc6e) [0x7fdb1c302b7e]
[ivb115:mpi_rank_0][print_backtrace]   9: /home-2/jglaser/hoomd-install/bin/../lib/hoomd/python-module/hoomd.so(_ZN5boost6python7objects11make_holderILi6EE5applyINS1_14pointer_holderINS_10shared_ptrI19DomainDecompositionEES7_EENS_3mpl7vector6INS6_I22ExecutionConfigurationEE6float3jjjbEEE7executeEP7_objectSD_SE_jjjb+0xd3) [0x7fdb1c3091e3]
[ivb115:mpi_rank_0][print_backtrace]  10: /home-2/jglaser/hoomd-install/bin/../lib/hoomd/python-module/hoomd.so(_ZN5boost6python7objects23caller_py_function_implINS0_6detail6callerIPFvP7_objectNS_10shared_ptrI22ExecutionConfigurationEE6float3jjjbENS0_21default_call_policiesENS_3mpl7vector8IvS6_S9_SA_jjjbEEEEEclES6_S6_+0x308) [0x7fdb1c304c08]
[ivb115:mpi_rank_0][print_backtrace]  11: /home-2/jglaser/local/lib/libboost_python3.so.1.55.0(_ZNK5boost6python7objects8function4callEP7_objectS4_+0xca) [0x7fdb1ac9f77a]
etc.

It runs well on more than 20 ranks (i.e. inter-node). Also, it runs fine on a single node
if I set MV2_USE_SHMEM_COLL=0, to disable the intra-node optimizations.

Please let me know if you need any more information.

best
Jens

P.S.: Version information:

MVAPICH2 Version:     	2.0
MVAPICH2 Release date:	Fri Jun 20 20:00:00 EDT 2014
MVAPICH2 Device:      	ch3:mrail
MVAPICH2 configure:   	--build=x86_64-unknown-linux-gnu --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu --program-prefix= --prefix=/opt/mvapich2/gdr/2.0/gnu --exec-prefix=/opt/mvapich2/gdr/2.0/gnu --bindir=/opt/mvapich2/gdr/2.0/gnu/bin --sbindir=/opt/mvapich2/gdr/2.0/gnu/sbin --sysconfdir=/opt/mvapich2/gdr/2.0/gnu/etc --datadir=/opt/mvapich2/gdr/2.0/gnu/share --includedir=/opt/mvapich2/gdr/2.0/gnu/include --libdir=/opt/mvapich2/gdr/2.0/gnu/lib64 --libexecdir=/opt/mvapich2/gdr/2.0/gnu/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/opt/mvapich2/gdr/2.0/gnu/share/man --infodir=/opt/mvapich2/gdr/2.0/gnu/share/info --disable-rpath --disable-static --enable-shared --disable-rdma-cm --disable-mcast --enable-cuda --without-hydra-ckpointlib CPPFLAGS=-I/usr/local/cuda/include LDFLAGS=-L/usr/local/cuda/lib64 -Wl,-rpath,/usr/local/cuda/lib64 -Wl,-rpath,XORIGIN/placeholder
MVAPICH2 CC:  	gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: 	g++ -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic  -DNDEBUG -DNVALGRIND
MVAPICH2 F77: 	gfortran -L/lib -L/lib -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -I/opt/mvapich2/gdr/2.0/gnu/lib64/gfortran/modules  -O2
MVAPICH2 FC:  	gfortran  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141019/c1851bad/attachment.html>


More information about the mvapich-discuss mailing list