[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband

Hajime Fujita hfujita at uchicago.edu
Sun Mar 30 16:20:19 EDT 2014


Hi Hari,

While the previous sample (mpimbench.c) worked well with 
MV2_NDREG_ENTRIES=2048, I found another example (random.c: see the 
attached file) for which the environment variable did not work.

I tried MV2_NDREG_ENTRIES up to 131072, but none of them worked. Any 
other suggestions? I really appreciate your help.

This program works well if
a) run on a single host
b) run with MVAPICH2-2.0b, even with Infiniband


[hfujita at midway-login2 mpimbench]$ MV2_NDREG_ENTRIES=2048 mpiexec -n 2 
-hosts midway-login1,midway-login2 ./random acc
[midway-login2:mpi_rank_1][error_sighandler] Caught error: Segmentation 
fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 18343 RUNNING AT midway-login2
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at midway-login1] HYD_pmcd_pmip_control_cmd_cb 
(pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:0 at midway-login1] HYDT_dmxu_poll_wait_for_event 
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at midway-login1] main (pm/pmiserv/pmip.c:206): demux engine 
error waiting for event
[mpiexec at midway-login2] HYDT_bscu_wait_for_completion 
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated 
badly; aborting
[mpiexec at midway-login2] HYDT_bsci_wait_for_completion 
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting 
for completion
[mpiexec at midway-login2] HYD_pmci_wait_for_completion 
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for 
completion
[mpiexec at midway-login2] main (ui/mpich/mpiexec.c:336): process manager 
error waiting for completion


Thanks,
Hajime

On 03/29/2014 10:19 AM, Hari Subramoni wrote:
> Hello Hajime,
>
> This is not a bug with the RMA design in MVAPICH2. The application is
> running out of memory that can be registered with the IB HCA. Can you
> please try running your application with the environment variable
> MV2_NDREG_ENTRIES=2048.
>
> Regards,
> Hari.
>
>
> On Tue, Mar 25, 2014 at 2:35 PM, Hajime Fujita <hfujita at uchicago.edu
> <mailto:hfujita at uchicago.edu>> wrote:
>
>     Dear MVAPICH team,
>
>     I was glad to hear the release of MVAPICH2-2.0rc1, and immediately
>     tried it. Then I found that my MPI-3 RMA program started crashing.
>
>     The attached simple program is enough to reproduce the issue. Here's
>     the output:
>
>     [hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host
>     midway-login1,midway-login2 ./mpimbench
>     Message-based ping pong
>     4, 1.272331
>     8, 0.620984
>     16, 0.323668
>     32, 0.221903
>     64, 0.076136
>     128, 0.033388
>     256, 0.016455
>     512, 0.007715
>     1024, 0.004121
>     2048, 0.002435
>     4096, 0.002345
>     8192, 0.002069
>     16384, 0.002067
>     32768, 0.006494
>     65536, 0.001325
>     131072, 0.000686
>     262144, 0.000491
>     524288, 0.000423
>     1048576, 0.000395
>     RMA-based put
>     16, 0.491239
>     32, 0.299855
>     64, 0.155028
>     128, 0.078400
>     256, 0.040418
>     512, 0.020406
>     1024, 0.009608
>     2048, 0.004888
>     4096, 0.002399
>     8192, 0.002702
>     [midway-login1:mpi_rank_0][__error_sighandler] Caught error:
>     Segmentation fault (signal 11)
>
>     ==============================__==============================__=======================
>     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>     =   PID 9519 RUNNING AT midway-login1
>     =   EXIT CODE: 11
>     =   CLEANING UP REMAINING PROCESSES
>     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>     ==============================__==============================__=======================
>     [proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb
>     (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>     [proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event
>     (tools/demux/demux_poll.c:76): callback returned error status
>     [proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206): demux engine
>     error waiting for event
>     [mpiexec at midway-login1] HYDT_bscu_wait_for_completion
>     (tools/bootstrap/utils/bscu___wait.c:76): one of the processes
>     terminated badly; aborting
>     [mpiexec at midway-login1] HYDT_bsci_wait_for_completion
>     (tools/bootstrap/src/bsci___wait.c:23): launcher returned error
>     waiting for completion
>     [mpiexec at midway-login1] HYD_pmci_wait_for_completion
>     (pm/pmiserv/pmiserv_pmci.c:__218): launcher returned error waiting
>     for completion
>     [mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336): process
>     manager error waiting for completion
>
>
>     This run was done on the UChicago Midway Cluster.
>     http://rcc.uchicago.edu/__resources/midway_specs.html
>     <http://rcc.uchicago.edu/resources/midway_specs.html>
>
>     One observation is that this issue happens only when I use
>     Infiniband for communication. If I launch the same program on a
>     single node, it successfully finishes.
>
>     And here's the output of mpichversion command.
>     [hfujita at midway-login1 mpimbench]$ mpichversion
>     MVAPICH2 Version:       2.0rc1
>     MVAPICH2 Release date:  Sun Mar 23 21:35:26 EDT 2014
>     MVAPICH2 Device:        ch3:mrail
>     MVAPICH2 configure:     --disable-option-checking
>     --prefix=/project/aachien/__local/mvapich2-2.0rc1-gcc-4.8
>     --enable-shared --disable-checkerrors --cache-file=/dev/null
>     --srcdir=. CC=gcc CFLAGS=-DNDEBUG -DNVALGRIND -O2 LDFLAGS=-L/lib
>     -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib LIBS=-libmad -libumad
>     -libverbs -lrt -lhwloc -lpthread -lhwloc
>     CPPFLAGS=-I/project/aachien/__local/src/mvapich2-2.0rc1-gcc-__4.8/src/mpid/ch3/channels/__mrail/include
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/mrail/__include
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/common/__include
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/common/__include
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/mrail/src/__gen2
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/mrail/src/__gen2
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/common/locks
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/common/locks
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__util/wrappers
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__util/wrappers
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpl/include
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpl/include
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__openpa/src
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__openpa/src
>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpi/romio/include
>     -I/include --with-cross=src/mpid/pamid/__cross/bgq8
>     --enable-threads=multiple
>     MVAPICH2 CC:    gcc -DNDEBUG -DNVALGRIND -O2   -DNDEBUG -DNVALGRIND -O2
>     MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND
>     MVAPICH2 F77:   gfortran   -O2
>     MVAPICH2 FC:    gfortran
>
>     If you need more explanation or information please let me know.
>
>
>     Thanks,
>     Hajime
>
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: random.c
Type: text/x-csrc
Size: 3104 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140330/258284dd/attachment.bin>


More information about the mvapich-discuss mailing list