[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband

Hari Subramoni subramoni.1 at osu.edu
Sat Mar 29 11:19:15 EDT 2014


Hello Hajime,

This is not a bug with the RMA design in MVAPICH2. The application is
running out of memory that can be registered with the IB HCA. Can you
please try running your application with the environment variable
MV2_NDREG_ENTRIES=2048.

Regards,
Hari.


On Tue, Mar 25, 2014 at 2:35 PM, Hajime Fujita <hfujita at uchicago.edu> wrote:

> Dear MVAPICH team,
>
> I was glad to hear the release of MVAPICH2-2.0rc1, and immediately tried
> it. Then I found that my MPI-3 RMA program started crashing.
>
> The attached simple program is enough to reproduce the issue. Here's the
> output:
>
> [hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host
> midway-login1,midway-login2 ./mpimbench
> Message-based ping pong
> 4, 1.272331
> 8, 0.620984
> 16, 0.323668
> 32, 0.221903
> 64, 0.076136
> 128, 0.033388
> 256, 0.016455
> 512, 0.007715
> 1024, 0.004121
> 2048, 0.002435
> 4096, 0.002345
> 8192, 0.002069
> 16384, 0.002067
> 32768, 0.006494
> 65536, 0.001325
> 131072, 0.000686
> 262144, 0.000491
> 524288, 0.000423
> 1048576, 0.000395
> RMA-based put
> 16, 0.491239
> 32, 0.299855
> 64, 0.155028
> 128, 0.078400
> 256, 0.040418
> 512, 0.020406
> 1024, 0.009608
> 2048, 0.004888
> 4096, 0.002399
> 8192, 0.002702
> [midway-login1:mpi_rank_0][error_sighandler] Caught error: Segmentation
> fault (signal 11)
>
> ============================================================
> =======================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 9519 RUNNING AT midway-login1
> =   EXIT CODE: 11
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ============================================================
> =======================
> [proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206): demux engine
> error waiting for event
> [mpiexec at midway-login1] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at midway-login1] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at midway-login1] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
> completion
> [mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336): process manager
> error waiting for completion
>
>
> This run was done on the UChicago Midway Cluster.
> http://rcc.uchicago.edu/resources/midway_specs.html
>
> One observation is that this issue happens only when I use Infiniband for
> communication. If I launch the same program on a single node, it
> successfully finishes.
>
> And here's the output of mpichversion command.
> [hfujita at midway-login1 mpimbench]$ mpichversion
> MVAPICH2 Version:       2.0rc1
> MVAPICH2 Release date:  Sun Mar 23 21:35:26 EDT 2014
> MVAPICH2 Device:        ch3:mrail
> MVAPICH2 configure:     --disable-option-checking
> --prefix=/project/aachien/local/mvapich2-2.0rc1-gcc-4.8 --enable-shared
> --disable-checkerrors --cache-file=/dev/null --srcdir=. CC=gcc
> CFLAGS=-DNDEBUG -DNVALGRIND -O2 LDFLAGS=-L/lib -Wl,-rpath,/lib -L/lib
> -Wl,-rpath,/lib LIBS=-libmad -libumad -libverbs -lrt -lhwloc -lpthread
> -lhwloc CPPFLAGS=-I/project/aachien/local/src/mvapich2-2.0rc1-gcc-
> 4.8/src/mpid/ch3/channels/mrail/include -I/project/aachien/local/src/
> mvapich2-2.0rc1-gcc-4.8/src/mpid/ch3/channels/mrail/include
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/
> mpid/ch3/channels/common/include -I/project/aachien/local/src/
> mvapich2-2.0rc1-gcc-4.8/src/mpid/ch3/channels/common/include
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/
> mpid/ch3/channels/mrail/src/gen2 -I/project/aachien/local/src/
> mvapich2-2.0rc1-gcc-4.8/src/mpid/ch3/channels/mrail/src/gen2
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/common/locks
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpid/common/locks
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/util/wrappers
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/util/wrappers
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpl/include
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpl/include
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/openpa/src
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/openpa/src
> -I/project/aachien/local/src/mvapich2-2.0rc1-gcc-4.8/src/mpi/romio/include
> -I/include --with-cross=src/mpid/pamid/cross/bgq8
> --enable-threads=multiple
> MVAPICH2 CC:    gcc -DNDEBUG -DNVALGRIND -O2   -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND
> MVAPICH2 F77:   gfortran   -O2
> MVAPICH2 FC:    gfortran
>
> If you need more explanation or information please let me know.
>
>
> Thanks,
> Hajime
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140329/6a4f214a/attachment.html>


More information about the mvapich-discuss mailing list