[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband
Mingzhe Li
li.2192 at osu.edu
Tue Apr 1 11:37:19 EDT 2014
Hi Hajime,
Thanks for your update. I tried your program with MV2-2.0rc1, but I was not
able to reproduce your issue. I will discuss further on this with you
off-line and get back to mv2-discuss on final outcome.
Mingzhe
On Tue, Apr 1, 2014 at 10:32 AM, Hajime Fujita <hfujita at uchicago.edu> wrote:
> Hi Mingzhe,
>
> Thank you for your advice. Your advice is exactly correct.
>
> However, even I move the declaration of `local` to the outside of the for
> loop (i.e. beginning of the main function / making it a global variable),
> it still crashes in the same way. Making it a heap variable did not help,
> either.
>
> Thus I suspect there's another cause...
>
>
> Thanks,
> Hajime
>
>
> On 03/31/2014 12:55 PM, Mingzhe Li wrote:
>
>> Hi Hajime,
>>
>> Thanks for reporting. I took a look at your program. There seems to be
>> an issue with the following code segment:
>>
>> for (i = 0; i < N_TRY; i++) {
>>
>> int local;
>>
>> rn = LCG_MUL64 * rn + LCG_ADD64;
>>
>> int tgt = (rn % (count_each * n_procs)) / count_each;
>>
>> size_t disp = rn % count_each;
>>
>> rma_op(&local, msg_size, tgt, disp, win);
>>
>> if (++n % n_outstanding == 0)
>>
>> MPI_Win_flush_all(win);
>>
>> }
>>
>> MPI_Win_flush_all(win);
>>
>> Here, I saw you initialized a variable "local" for put/get/acc
>> operations. Since, this variable is in stack, it may not exist when this
>> for loop finishes, before the rma_op is completed using a flush
>> operation. Could you create that variable in heap memory (malloc) rather
>> than in stack?
>>
>> Mingzhe
>>
>>
>> On Sun, Mar 30, 2014 at 4:20 PM, Hajime Fujita <hfujita at uchicago.edu
>> <mailto:hfujita at uchicago.edu>> wrote:
>>
>> Hi Hari,
>>
>> While the previous sample (mpimbench.c) worked well with
>> MV2_NDREG_ENTRIES=2048, I found another example (random.c: see the
>> attached file) for which the environment variable did not work.
>>
>> I tried MV2_NDREG_ENTRIES up to 131072, but none of them worked. Any
>> other suggestions? I really appreciate your help.
>>
>> This program works well if
>> a) run on a single host
>> b) run with MVAPICH2-2.0b, even with Infiniband
>>
>>
>> [hfujita at midway-login2 mpimbench]$ MV2_NDREG_ENTRIES=2048 mpiexec -n
>> 2 -hosts midway-login1,midway-login2 ./random acc
>> [midway-login2:mpi_rank_1][__error_sighandler] Caught error:
>>
>> Segmentation fault (signal 11)
>>
>>
>> ==============================__============================
>> ==__=======================
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = PID 18343 RUNNING AT midway-login2
>>
>> = EXIT CODE: 11
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ==============================__============================
>> ==__=======================
>>
>> [proxy:0:0 at midway-login1] HYD_pmcd_pmip_control_cmd_cb
>> (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>> [proxy:0:0 at midway-login1] HYDT_dmxu_poll_wait_for_event
>> (tools/demux/demux_poll.c:76): callback returned error status
>> [proxy:0:0 at midway-login1] main (pm/pmiserv/pmip.c:206): demux engine
>> error waiting for event
>> [mpiexec at midway-login2] HYDT_bscu_wait_for_completion
>> (tools/bootstrap/utils/bscu___wait.c:76): one of the processes
>> terminated badly; aborting
>> [mpiexec at midway-login2] HYDT_bsci_wait_for_completion
>>
>> (tools/bootstrap/src/bsci___wait.c:23): launcher returned error
>> waiting for completion
>> [mpiexec at midway-login2] HYD_pmci_wait_for_completion
>>
>> (pm/pmiserv/pmiserv_pmci.c:__218): launcher returned error waiting
>> for completion
>> [mpiexec at midway-login2] main (ui/mpich/mpiexec.c:336): process
>> manager error waiting for completion
>>
>>
>>
>> Thanks,
>> Hajime
>>
>> On 03/29/2014 10:19 AM, Hari Subramoni wrote:
>>
>> Hello Hajime,
>>
>> This is not a bug with the RMA design in MVAPICH2. The
>> application is
>> running out of memory that can be registered with the IB HCA.
>> Can you
>> please try running your application with the environment variable
>> MV2_NDREG_ENTRIES=2048.
>>
>> Regards,
>> Hari.
>>
>>
>> On Tue, Mar 25, 2014 at 2:35 PM, Hajime Fujita
>> <hfujita at uchicago.edu <mailto:hfujita at uchicago.edu>
>> <mailto:hfujita at uchicago.edu <mailto:hfujita at uchicago.edu>>>
>> wrote:
>>
>> Dear MVAPICH team,
>>
>> I was glad to hear the release of MVAPICH2-2.0rc1, and
>> immediately
>> tried it. Then I found that my MPI-3 RMA program started
>> crashing.
>>
>> The attached simple program is enough to reproduce the
>> issue. Here's
>> the output:
>>
>> [hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host
>> midway-login1,midway-login2 ./mpimbench
>> Message-based ping pong
>> 4, 1.272331
>> 8, 0.620984
>> 16, 0.323668
>> 32, 0.221903
>> 64, 0.076136
>> 128, 0.033388
>> 256, 0.016455
>> 512, 0.007715
>> 1024, 0.004121
>> 2048, 0.002435
>> 4096, 0.002345
>> 8192, 0.002069
>> 16384, 0.002067
>> 32768, 0.006494
>> 65536, 0.001325
>> 131072, 0.000686
>> 262144, 0.000491
>> 524288, 0.000423
>> 1048576, 0.000395
>> RMA-based put
>> 16, 0.491239
>> 32, 0.299855
>> 64, 0.155028
>> 128, 0.078400
>> 256, 0.040418
>> 512, 0.020406
>> 1024, 0.009608
>> 2048, 0.004888
>> 4096, 0.002399
>> 8192, 0.002702
>> [midway-login1:mpi_rank_0][____error_sighandler] Caught
>> error:
>> Segmentation fault (signal 11)
>>
>>
>> ==============================____==========================
>> ==__==__=======================
>>
>>
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = PID 9519 RUNNING AT midway-login1
>> = EXIT CODE: 11
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> ==============================____==========================
>> ==__==__=======================
>>
>>
>> [proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb
>> (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>> [proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event
>> (tools/demux/demux_poll.c:76): callback returned error status
>> [proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206):
>> demux engine
>> error waiting for event
>> [mpiexec at midway-login1] HYDT_bscu_wait_for_completion
>> (tools/bootstrap/utils/bscu_____wait.c:76): one of the
>>
>> processes
>>
>> terminated badly; aborting
>> [mpiexec at midway-login1] HYDT_bsci_wait_for_completion
>> (tools/bootstrap/src/bsci_____wait.c:23): launcher returned
>>
>> error
>>
>> waiting for completion
>> [mpiexec at midway-login1] HYD_pmci_wait_for_completion
>> (pm/pmiserv/pmiserv_pmci.c:____218): launcher returned
>>
>> error waiting
>>
>> for completion
>> [mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336):
>> process
>> manager error waiting for completion
>>
>>
>> This run was done on the UChicago Midway Cluster.
>> http://rcc.uchicago.edu/____resources/midway_specs.html
>> <http://rcc.uchicago.edu/__resources/midway_specs.html>
>>
>>
>> <http://rcc.uchicago.edu/__resources/midway_specs.html
>> <http://rcc.uchicago.edu/resources/midway_specs.html>>
>>
>> One observation is that this issue happens only when I use
>> Infiniband for communication. If I launch the same program
>> on a
>> single node, it successfully finishes.
>>
>> And here's the output of mpichversion command.
>> [hfujita at midway-login1 mpimbench]$ mpichversion
>> MVAPICH2 Version: 2.0rc1
>> MVAPICH2 Release date: Sun Mar 23 21:35:26 EDT 2014
>> MVAPICH2 Device: ch3:mrail
>> MVAPICH2 configure: --disable-option-checking
>> --prefix=/project/aachien/____local/mvapich2-2.0rc1-gcc-4.8
>>
>>
>> --enable-shared --disable-checkerrors --cache-file=/dev/null
>> --srcdir=. CC=gcc CFLAGS=-DNDEBUG -DNVALGRIND -O2
>> LDFLAGS=-L/lib
>> -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib LIBS=-libmad -libumad
>> -libverbs -lrt -lhwloc -lpthread -lhwloc
>>
>> CPPFLAGS=-I/project/aachien/____local/src/mvapich2-2.0rc1-
>> gcc-____4.8/src/mpid/ch3/channels/____mrail/include
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/mrail/____include
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/common/____include
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/common/____include
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/mrail/src/____gen2
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/mrail/src/____gen2
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/common/locks
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/common/locks
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____util/wrappers
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____util/wrappers
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpl/include
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpl/include
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____openpa/src
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____openpa/src
>>
>> -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpi/romio/include
>> -I/include --with-cross=src/mpid/pamid/____cross/bgq8
>>
>> --enable-threads=multiple
>> MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2 -DNDEBUG
>> -DNVALGRIND -O2
>> MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND
>> MVAPICH2 F77: gfortran -O2
>> MVAPICH2 FC: gfortran
>>
>> If you need more explanation or information please let me
>> know.
>>
>>
>> Thanks,
>> Hajime
>>
>> _________________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-__state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> <mailto:mvapich-discuss at cse.__ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>>
>> http://mailman.cse.ohio-state.__edu/mailman/listinfo/
>> mvapich-__discuss
>>
>> <http://mailman.cse.ohio-state.edu/mailman/listinfo/
>> mvapich-discuss>
>>
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140401/d6e20e2c/attachment-0001.html>
More information about the mvapich-discuss
mailing list