[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband

Hajime Fujita hfujita at uchicago.edu
Tue Mar 25 14:35:06 EDT 2014


Dear MVAPICH team,

I was glad to hear the release of MVAPICH2-2.0rc1, and immediately tried 
it. Then I found that my MPI-3 RMA program started crashing.

The attached simple program is enough to reproduce the issue. Here's the 
output:

[hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host 
midway-login1,midway-login2 ./mpimbench
Message-based ping pong
4, 1.272331
8, 0.620984
16, 0.323668
32, 0.221903
64, 0.076136
128, 0.033388
256, 0.016455
512, 0.007715
1024, 0.004121
2048, 0.002435
4096, 0.002345
8192, 0.002069
16384, 0.002067
32768, 0.006494
65536, 0.001325
131072, 0.000686
262144, 0.000491
524288, 0.000423
1048576, 0.000395
RMA-based put
16, 0.491239
32, 0.299855
64, 0.155028
128, 0.078400
256, 0.040418
512, 0.020406
1024, 0.009608
2048, 0.004888
4096, 0.002399
8192, 0.002702
[midway-login1:mpi_rank_0][error_sighandler] Caught error: Segmentation 
fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 9519 RUNNING AT midway-login1
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb 
(pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event 
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206): demux engine 
error waiting for event
[mpiexec at midway-login1] HYDT_bscu_wait_for_completion 
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated 
badly; aborting
[mpiexec at midway-login1] HYDT_bsci_wait_for_completion 
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting 
for completion
[mpiexec at midway-login1] HYD_pmci_wait_for_completion 
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for 
completion
[mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336): process manager 
error waiting for completion


This run was done on the UChicago Midway Cluster.
http://rcc.uchicago.edu/resources/midway_specs.html

One observation is that this issue happens only when I use Infiniband 
for communication. If I launch the same program on a single node, it 
successfully finishes.

And here's the output of mpichversion command.
[hfujita at midway-login1 mpimbench]$ mpichversion
MVAPICH2 Version:     	2.0rc1
MVAPICH2 Release date:	Sun Mar 23 21:35:26 EDT 2014
MVAPICH2 Device:      	ch3:mrail
MVAPICH2 configure:   	--disable-option-checking 
--prefix=/project/aachien/local/mvapich2-2.0rc1-gcc-4.8 --enable-shared 
--disable-checkerrors --cache-file=/dev/null --srcdir=. CC=gcc 
CFLAGS=-DNDEBUG -DNVALGRIND -O2 LDFLAGS=-L/lib -Wl,-rpath,/lib -L/lib 
-Wl,-rpath,/lib LIBS=-libmad -libumad -libverbs -lrt -lhwloc -lpthread 
-lhwloc 
CPPFLAGS=-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/ch3/channels/mrail/include 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/ch3/channels/mrail/include 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/ch3/channels/common/include 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/ch3/channels/common/include 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/ch3/channels/mrail/src/gen2 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/ch3/channels/mrail/src/gen2 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/common/locks 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/common/locks 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/util/wrappers 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/util/wrappers 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpl/include 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpl/include 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/openpa/src 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/openpa/src 
-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpi/romio/include 
-I/include --with-cross=src/mpid/pamid/cross/bgq8 --enable-threads=multiple
MVAPICH2 CC:  	gcc -DNDEBUG -DNVALGRIND -O2   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: 	g++   -DNDEBUG -DNVALGRIND
MVAPICH2 F77: 	gfortran   -O2
MVAPICH2 FC:  	gfortran

If you need more explanation or information please let me know.


Thanks,
Hajime
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpimbench.c
Type: text/x-csrc
Size: 5175 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140325/6d4656ca/attachment.bin>


More information about the mvapich-discuss mailing list