[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband
Hajime Fujita
hfujita at uchicago.edu
Sun Mar 30 16:20:19 EDT 2014
Hi Hari,
While the previous sample (mpimbench.c) worked well with
MV2_NDREG_ENTRIES=2048, I found another example (random.c: see the
attached file) for which the environment variable did not work.
I tried MV2_NDREG_ENTRIES up to 131072, but none of them worked. Any
other suggestions? I really appreciate your help.
This program works well if
a) run on a single host
b) run with MVAPICH2-2.0b, even with Infiniband
[hfujita at midway-login2 mpimbench]$ MV2_NDREG_ENTRIES=2048 mpiexec -n 2
-hosts midway-login1,midway-login2 ./random acc
[midway-login2:mpi_rank_1][error_sighandler] Caught error: Segmentation
fault (signal 11)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18343 RUNNING AT midway-login2
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at midway-login1] HYD_pmcd_pmip_control_cmd_cb
(pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:0 at midway-login1] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at midway-login1] main (pm/pmiserv/pmip.c:206): demux engine
error waiting for event
[mpiexec at midway-login2] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at midway-login2] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
for completion
[mpiexec at midway-login2] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
completion
[mpiexec at midway-login2] main (ui/mpich/mpiexec.c:336): process manager
error waiting for completion
Thanks,
Hajime
On 03/29/2014 10:19 AM, Hari Subramoni wrote:
> Hello Hajime,
>
> This is not a bug with the RMA design in MVAPICH2. The application is
> running out of memory that can be registered with the IB HCA. Can you
> please try running your application with the environment variable
> MV2_NDREG_ENTRIES=2048.
>
> Regards,
> Hari.
>
>
> On Tue, Mar 25, 2014 at 2:35 PM, Hajime Fujita <hfujita at uchicago.edu
> <mailto:hfujita at uchicago.edu>> wrote:
>
> Dear MVAPICH team,
>
> I was glad to hear the release of MVAPICH2-2.0rc1, and immediately
> tried it. Then I found that my MPI-3 RMA program started crashing.
>
> The attached simple program is enough to reproduce the issue. Here's
> the output:
>
> [hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host
> midway-login1,midway-login2 ./mpimbench
> Message-based ping pong
> 4, 1.272331
> 8, 0.620984
> 16, 0.323668
> 32, 0.221903
> 64, 0.076136
> 128, 0.033388
> 256, 0.016455
> 512, 0.007715
> 1024, 0.004121
> 2048, 0.002435
> 4096, 0.002345
> 8192, 0.002069
> 16384, 0.002067
> 32768, 0.006494
> 65536, 0.001325
> 131072, 0.000686
> 262144, 0.000491
> 524288, 0.000423
> 1048576, 0.000395
> RMA-based put
> 16, 0.491239
> 32, 0.299855
> 64, 0.155028
> 128, 0.078400
> 256, 0.040418
> 512, 0.020406
> 1024, 0.009608
> 2048, 0.004888
> 4096, 0.002399
> 8192, 0.002702
> [midway-login1:mpi_rank_0][__error_sighandler] Caught error:
> Segmentation fault (signal 11)
>
> ==============================__==============================__=======================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 9519 RUNNING AT midway-login1
> = EXIT CODE: 11
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ==============================__==============================__=======================
> [proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206): demux engine
> error waiting for event
> [mpiexec at midway-login1] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu___wait.c:76): one of the processes
> terminated badly; aborting
> [mpiexec at midway-login1] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci___wait.c:23): launcher returned error
> waiting for completion
> [mpiexec at midway-login1] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:__218): launcher returned error waiting
> for completion
> [mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336): process
> manager error waiting for completion
>
>
> This run was done on the UChicago Midway Cluster.
> http://rcc.uchicago.edu/__resources/midway_specs.html
> <http://rcc.uchicago.edu/resources/midway_specs.html>
>
> One observation is that this issue happens only when I use
> Infiniband for communication. If I launch the same program on a
> single node, it successfully finishes.
>
> And here's the output of mpichversion command.
> [hfujita at midway-login1 mpimbench]$ mpichversion
> MVAPICH2 Version: 2.0rc1
> MVAPICH2 Release date: Sun Mar 23 21:35:26 EDT 2014
> MVAPICH2 Device: ch3:mrail
> MVAPICH2 configure: --disable-option-checking
> --prefix=/project/aachien/__local/mvapich2-2.0rc1-gcc-4.8
> --enable-shared --disable-checkerrors --cache-file=/dev/null
> --srcdir=. CC=gcc CFLAGS=-DNDEBUG -DNVALGRIND -O2 LDFLAGS=-L/lib
> -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib LIBS=-libmad -libumad
> -libverbs -lrt -lhwloc -lpthread -lhwloc
> CPPFLAGS=-I/project/aachien/__local/src/mvapich2-2.0rc1-gcc-__4.8/src/mpid/ch3/channels/__mrail/include
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/mrail/__include
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/common/__include
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/common/__include
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/mrail/src/__gen2
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/ch3/channels/mrail/src/__gen2
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/common/locks
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpid/common/locks
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__util/wrappers
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__util/wrappers
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpl/include
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpl/include
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__openpa/src
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__openpa/src
> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/__mpi/romio/include
> -I/include --with-cross=src/mpid/pamid/__cross/bgq8
> --enable-threads=multiple
> MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2 -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND
> MVAPICH2 F77: gfortran -O2
> MVAPICH2 FC: gfortran
>
> If you need more explanation or information please let me know.
>
>
> Thanks,
> Hajime
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: random.c
Type: text/x-csrc
Size: 3104 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140330/258284dd/attachment.bin>
More information about the mvapich-discuss
mailing list