[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband
Hajime Fujita
hfujita at uchicago.edu
Tue Apr 1 10:32:33 EDT 2014
Hi Mingzhe,
Thank you for your advice. Your advice is exactly correct.
However, even I move the declaration of `local` to the outside of the
for loop (i.e. beginning of the main function / making it a global
variable), it still crashes in the same way. Making it a heap variable
did not help, either.
Thus I suspect there's another cause...
Thanks,
Hajime
On 03/31/2014 12:55 PM, Mingzhe Li wrote:
> Hi Hajime,
>
> Thanks for reporting. I took a look at your program. There seems to be
> an issue with the following code segment:
>
> for (i = 0; i < N_TRY; i++) {
>
> int local;
>
> rn = LCG_MUL64 * rn + LCG_ADD64;
>
> int tgt = (rn % (count_each * n_procs)) / count_each;
>
> size_t disp = rn % count_each;
>
> rma_op(&local, msg_size, tgt, disp, win);
>
> if (++n % n_outstanding == 0)
>
> MPI_Win_flush_all(win);
>
> }
>
> MPI_Win_flush_all(win);
>
> Here, I saw you initialized a variable "local" for put/get/acc
> operations. Since, this variable is in stack, it may not exist when this
> for loop finishes, before the rma_op is completed using a flush
> operation. Could you create that variable in heap memory (malloc) rather
> than in stack?
>
> Mingzhe
>
>
> On Sun, Mar 30, 2014 at 4:20 PM, Hajime Fujita <hfujita at uchicago.edu
> <mailto:hfujita at uchicago.edu>> wrote:
>
> Hi Hari,
>
> While the previous sample (mpimbench.c) worked well with
> MV2_NDREG_ENTRIES=2048, I found another example (random.c: see the
> attached file) for which the environment variable did not work.
>
> I tried MV2_NDREG_ENTRIES up to 131072, but none of them worked. Any
> other suggestions? I really appreciate your help.
>
> This program works well if
> a) run on a single host
> b) run with MVAPICH2-2.0b, even with Infiniband
>
>
> [hfujita at midway-login2 mpimbench]$ MV2_NDREG_ENTRIES=2048 mpiexec -n
> 2 -hosts midway-login1,midway-login2 ./random acc
> [midway-login2:mpi_rank_1][__error_sighandler] Caught error:
> Segmentation fault (signal 11)
>
>
> ==============================__==============================__=======================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 18343 RUNNING AT midway-login2
>
> = EXIT CODE: 11
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ==============================__==============================__=======================
> [proxy:0:0 at midway-login1] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:0 at midway-login1] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at midway-login1] main (pm/pmiserv/pmip.c:206): demux engine
> error waiting for event
> [mpiexec at midway-login2] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu___wait.c:76): one of the processes
> terminated badly; aborting
> [mpiexec at midway-login2] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci___wait.c:23): launcher returned error
> waiting for completion
> [mpiexec at midway-login2] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:__218): launcher returned error waiting
> for completion
> [mpiexec at midway-login2] main (ui/mpich/mpiexec.c:336): process
> manager error waiting for completion
>
>
>
> Thanks,
> Hajime
>
> On 03/29/2014 10:19 AM, Hari Subramoni wrote:
>
> Hello Hajime,
>
> This is not a bug with the RMA design in MVAPICH2. The
> application is
> running out of memory that can be registered with the IB HCA.
> Can you
> please try running your application with the environment variable
> MV2_NDREG_ENTRIES=2048.
>
> Regards,
> Hari.
>
>
> On Tue, Mar 25, 2014 at 2:35 PM, Hajime Fujita
> <hfujita at uchicago.edu <mailto:hfujita at uchicago.edu>
> <mailto:hfujita at uchicago.edu <mailto:hfujita at uchicago.edu>>> wrote:
>
> Dear MVAPICH team,
>
> I was glad to hear the release of MVAPICH2-2.0rc1, and
> immediately
> tried it. Then I found that my MPI-3 RMA program started
> crashing.
>
> The attached simple program is enough to reproduce the
> issue. Here's
> the output:
>
> [hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host
> midway-login1,midway-login2 ./mpimbench
> Message-based ping pong
> 4, 1.272331
> 8, 0.620984
> 16, 0.323668
> 32, 0.221903
> 64, 0.076136
> 128, 0.033388
> 256, 0.016455
> 512, 0.007715
> 1024, 0.004121
> 2048, 0.002435
> 4096, 0.002345
> 8192, 0.002069
> 16384, 0.002067
> 32768, 0.006494
> 65536, 0.001325
> 131072, 0.000686
> 262144, 0.000491
> 524288, 0.000423
> 1048576, 0.000395
> RMA-based put
> 16, 0.491239
> 32, 0.299855
> 64, 0.155028
> 128, 0.078400
> 256, 0.040418
> 512, 0.020406
> 1024, 0.009608
> 2048, 0.004888
> 4096, 0.002399
> 8192, 0.002702
> [midway-login1:mpi_rank_0][____error_sighandler] Caught error:
> Segmentation fault (signal 11)
>
>
> ==============================____============================__==__=======================
>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 9519 RUNNING AT midway-login1
> = EXIT CODE: 11
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ==============================____============================__==__=======================
>
> [proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206):
> demux engine
> error waiting for event
> [mpiexec at midway-login1] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu_____wait.c:76): one of the
> processes
>
> terminated badly; aborting
> [mpiexec at midway-login1] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_____wait.c:23): launcher returned
> error
>
> waiting for completion
> [mpiexec at midway-login1] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:____218): launcher returned
> error waiting
>
> for completion
> [mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336): process
> manager error waiting for completion
>
>
> This run was done on the UChicago Midway Cluster.
> http://rcc.uchicago.edu/____resources/midway_specs.html
> <http://rcc.uchicago.edu/__resources/midway_specs.html>
>
> <http://rcc.uchicago.edu/__resources/midway_specs.html
> <http://rcc.uchicago.edu/resources/midway_specs.html>>
>
> One observation is that this issue happens only when I use
> Infiniband for communication. If I launch the same program on a
> single node, it successfully finishes.
>
> And here's the output of mpichversion command.
> [hfujita at midway-login1 mpimbench]$ mpichversion
> MVAPICH2 Version: 2.0rc1
> MVAPICH2 Release date: Sun Mar 23 21:35:26 EDT 2014
> MVAPICH2 Device: ch3:mrail
> MVAPICH2 configure: --disable-option-checking
> --prefix=/project/aachien/____local/mvapich2-2.0rc1-gcc-4.8
>
> --enable-shared --disable-checkerrors --cache-file=/dev/null
> --srcdir=. CC=gcc CFLAGS=-DNDEBUG -DNVALGRIND -O2
> LDFLAGS=-L/lib
> -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib LIBS=-libmad -libumad
> -libverbs -lrt -lhwloc -lpthread -lhwloc
>
> CPPFLAGS=-I/project/aachien/____local/src/mvapich2-2.0rc1-gcc-____4.8/src/mpid/ch3/channels/____mrail/include
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/mrail/____include
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/common/____include
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/common/____include
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/mrail/src/____gen2
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/mrail/src/____gen2
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/common/locks
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/common/locks
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____util/wrappers
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____util/wrappers
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpl/include
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpl/include
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____openpa/src
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____openpa/src
>
> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpi/romio/include
> -I/include --with-cross=src/mpid/pamid/____cross/bgq8
> --enable-threads=multiple
> MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2 -DNDEBUG
> -DNVALGRIND -O2
> MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND
> MVAPICH2 F77: gfortran -O2
> MVAPICH2 FC: gfortran
>
> If you need more explanation or information please let me know.
>
>
> Thanks,
> Hajime
>
> _________________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-__state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> <mailto:mvapich-discuss at cse.__ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>>
> http://mailman.cse.ohio-state.__edu/mailman/listinfo/mvapich-__discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
More information about the mvapich-discuss
mailing list