[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband

Mingzhe Li li.2192 at osu.edu
Tue Apr 1 11:37:19 EDT 2014


Hi Hajime,

Thanks for your update. I tried your program with MV2-2.0rc1, but I was not
able to reproduce your issue. I will discuss further on this with you
off-line and get back to mv2-discuss on final outcome.

Mingzhe


On Tue, Apr 1, 2014 at 10:32 AM, Hajime Fujita <hfujita at uchicago.edu> wrote:

> Hi Mingzhe,
>
> Thank you for your advice. Your advice is exactly correct.
>
> However, even I move the declaration of `local` to the outside of the for
> loop (i.e. beginning of the main function / making it a global variable),
> it still crashes in the same way. Making it a heap variable did not help,
> either.
>
> Thus I suspect there's another cause...
>
>
> Thanks,
> Hajime
>
>
> On 03/31/2014 12:55 PM, Mingzhe Li wrote:
>
>> Hi Hajime,
>>
>> Thanks for reporting. I took a look at your program. There seems to be
>> an issue with the following code segment:
>>
>>   for (i = 0; i < N_TRY; i++) {
>>
>>          int local;
>>
>>          rn = LCG_MUL64 * rn + LCG_ADD64;
>>
>>          int tgt = (rn % (count_each * n_procs)) / count_each;
>>
>>          size_t disp = rn % count_each;
>>
>>          rma_op(&local, msg_size, tgt, disp, win);
>>
>>          if (++n % n_outstanding == 0)
>>
>>              MPI_Win_flush_all(win);
>>
>>      }
>>
>>    MPI_Win_flush_all(win);
>>
>> Here, I saw you initialized a variable "local" for put/get/acc
>> operations. Since, this variable is in stack, it may not exist when this
>> for loop finishes, before the rma_op is completed using a flush
>> operation. Could you create that variable in heap memory (malloc) rather
>> than in stack?
>>
>> Mingzhe
>>
>>
>> On Sun, Mar 30, 2014 at 4:20 PM, Hajime Fujita <hfujita at uchicago.edu
>> <mailto:hfujita at uchicago.edu>> wrote:
>>
>>     Hi Hari,
>>
>>     While the previous sample (mpimbench.c) worked well with
>>     MV2_NDREG_ENTRIES=2048, I found another example (random.c: see the
>>     attached file) for which the environment variable did not work.
>>
>>     I tried MV2_NDREG_ENTRIES up to 131072, but none of them worked. Any
>>     other suggestions? I really appreciate your help.
>>
>>     This program works well if
>>     a) run on a single host
>>     b) run with MVAPICH2-2.0b, even with Infiniband
>>
>>
>>     [hfujita at midway-login2 mpimbench]$ MV2_NDREG_ENTRIES=2048 mpiexec -n
>>     2 -hosts midway-login1,midway-login2 ./random acc
>>     [midway-login2:mpi_rank_1][__error_sighandler] Caught error:
>>
>>     Segmentation fault (signal 11)
>>
>>
>>     ==============================__============================
>> ==__=======================
>>     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>     =   PID 18343 RUNNING AT midway-login2
>>
>>     =   EXIT CODE: 11
>>     =   CLEANING UP REMAINING PROCESSES
>>     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>     ==============================__============================
>> ==__=======================
>>
>>     [proxy:0:0 at midway-login1] HYD_pmcd_pmip_control_cmd_cb
>>     (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>>     [proxy:0:0 at midway-login1] HYDT_dmxu_poll_wait_for_event
>>     (tools/demux/demux_poll.c:76): callback returned error status
>>     [proxy:0:0 at midway-login1] main (pm/pmiserv/pmip.c:206): demux engine
>>     error waiting for event
>>     [mpiexec at midway-login2] HYDT_bscu_wait_for_completion
>>     (tools/bootstrap/utils/bscu___wait.c:76): one of the processes
>>     terminated badly; aborting
>>     [mpiexec at midway-login2] HYDT_bsci_wait_for_completion
>>
>>     (tools/bootstrap/src/bsci___wait.c:23): launcher returned error
>>     waiting for completion
>>     [mpiexec at midway-login2] HYD_pmci_wait_for_completion
>>
>>     (pm/pmiserv/pmiserv_pmci.c:__218): launcher returned error waiting
>>     for completion
>>     [mpiexec at midway-login2] main (ui/mpich/mpiexec.c:336): process
>>     manager error waiting for completion
>>
>>
>>
>>     Thanks,
>>     Hajime
>>
>>     On 03/29/2014 10:19 AM, Hari Subramoni wrote:
>>
>>         Hello Hajime,
>>
>>         This is not a bug with the RMA design in MVAPICH2. The
>>         application is
>>         running out of memory that can be registered with the IB HCA.
>>         Can you
>>         please try running your application with the environment variable
>>         MV2_NDREG_ENTRIES=2048.
>>
>>         Regards,
>>         Hari.
>>
>>
>>         On Tue, Mar 25, 2014 at 2:35 PM, Hajime Fujita
>>         <hfujita at uchicago.edu <mailto:hfujita at uchicago.edu>
>>         <mailto:hfujita at uchicago.edu <mailto:hfujita at uchicago.edu>>>
>> wrote:
>>
>>              Dear MVAPICH team,
>>
>>              I was glad to hear the release of MVAPICH2-2.0rc1, and
>>         immediately
>>              tried it. Then I found that my MPI-3 RMA program started
>>         crashing.
>>
>>              The attached simple program is enough to reproduce the
>>         issue. Here's
>>              the output:
>>
>>              [hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host
>>              midway-login1,midway-login2 ./mpimbench
>>              Message-based ping pong
>>              4, 1.272331
>>              8, 0.620984
>>              16, 0.323668
>>              32, 0.221903
>>              64, 0.076136
>>              128, 0.033388
>>              256, 0.016455
>>              512, 0.007715
>>              1024, 0.004121
>>              2048, 0.002435
>>              4096, 0.002345
>>              8192, 0.002069
>>              16384, 0.002067
>>              32768, 0.006494
>>              65536, 0.001325
>>              131072, 0.000686
>>              262144, 0.000491
>>              524288, 0.000423
>>              1048576, 0.000395
>>              RMA-based put
>>              16, 0.491239
>>              32, 0.299855
>>              64, 0.155028
>>              128, 0.078400
>>              256, 0.040418
>>              512, 0.020406
>>              1024, 0.009608
>>              2048, 0.004888
>>              4096, 0.002399
>>              8192, 0.002702
>>              [midway-login1:mpi_rank_0][____error_sighandler] Caught
>> error:
>>              Segmentation fault (signal 11)
>>
>>
>>         ==============================____==========================
>> ==__==__=======================
>>
>>
>>              =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>              =   PID 9519 RUNNING AT midway-login1
>>              =   EXIT CODE: 11
>>              =   CLEANING UP REMAINING PROCESSES
>>              =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>>         ==============================____==========================
>> ==__==__=======================
>>
>>
>>              [proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb
>>              (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>>              [proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event
>>              (tools/demux/demux_poll.c:76): callback returned error status
>>              [proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206):
>>         demux engine
>>              error waiting for event
>>              [mpiexec at midway-login1] HYDT_bscu_wait_for_completion
>>              (tools/bootstrap/utils/bscu_____wait.c:76): one of the
>>
>>         processes
>>
>>              terminated badly; aborting
>>              [mpiexec at midway-login1] HYDT_bsci_wait_for_completion
>>              (tools/bootstrap/src/bsci_____wait.c:23): launcher returned
>>
>>         error
>>
>>              waiting for completion
>>              [mpiexec at midway-login1] HYD_pmci_wait_for_completion
>>              (pm/pmiserv/pmiserv_pmci.c:____218): launcher returned
>>
>>         error waiting
>>
>>              for completion
>>              [mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336):
>> process
>>              manager error waiting for completion
>>
>>
>>              This run was done on the UChicago Midway Cluster.
>>         http://rcc.uchicago.edu/____resources/midway_specs.html
>>         <http://rcc.uchicago.edu/__resources/midway_specs.html>
>>
>>
>>              <http://rcc.uchicago.edu/__resources/midway_specs.html
>>         <http://rcc.uchicago.edu/resources/midway_specs.html>>
>>
>>              One observation is that this issue happens only when I use
>>              Infiniband for communication. If I launch the same program
>> on a
>>              single node, it successfully finishes.
>>
>>              And here's the output of mpichversion command.
>>              [hfujita at midway-login1 mpimbench]$ mpichversion
>>              MVAPICH2 Version:       2.0rc1
>>              MVAPICH2 Release date:  Sun Mar 23 21:35:26 EDT 2014
>>              MVAPICH2 Device:        ch3:mrail
>>              MVAPICH2 configure:     --disable-option-checking
>>              --prefix=/project/aachien/____local/mvapich2-2.0rc1-gcc-4.8
>>
>>
>>              --enable-shared --disable-checkerrors --cache-file=/dev/null
>>              --srcdir=. CC=gcc CFLAGS=-DNDEBUG -DNVALGRIND -O2
>>         LDFLAGS=-L/lib
>>              -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib LIBS=-libmad -libumad
>>              -libverbs -lrt -lhwloc -lpthread -lhwloc
>>
>>         CPPFLAGS=-I/project/aachien/____local/src/mvapich2-2.0rc1-
>> gcc-____4.8/src/mpid/ch3/channels/____mrail/include
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/mrail/____include
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/common/____include
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/common/____include
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/mrail/src/____gen2
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/ch3/channels/mrail/src/____gen2
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/common/locks
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpid/common/locks
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____util/wrappers
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____util/wrappers
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpl/include
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpl/include
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____openpa/src
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____openpa/src
>>
>>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/
>> src/____mpi/romio/include
>>              -I/include --with-cross=src/mpid/pamid/____cross/bgq8
>>
>>              --enable-threads=multiple
>>              MVAPICH2 CC:    gcc -DNDEBUG -DNVALGRIND -O2   -DNDEBUG
>>         -DNVALGRIND -O2
>>              MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND
>>              MVAPICH2 F77:   gfortran   -O2
>>              MVAPICH2 FC:    gfortran
>>
>>              If you need more explanation or information please let me
>> know.
>>
>>
>>              Thanks,
>>              Hajime
>>
>>              _________________________________________________
>>              mvapich-discuss mailing list
>>         mvapich-discuss at cse.ohio-__state.edu
>>         <mailto:mvapich-discuss at cse.ohio-state.edu>
>>              <mailto:mvapich-discuss at cse.__ohio-state.edu
>>         <mailto:mvapich-discuss at cse.ohio-state.edu>>
>>         http://mailman.cse.ohio-state.__edu/mailman/listinfo/
>> mvapich-__discuss
>>
>>         <http://mailman.cse.ohio-state.edu/mailman/listinfo/
>> mvapich-discuss>
>>
>>
>>
>>
>>     _______________________________________________
>>     mvapich-discuss mailing list
>>     mvapich-discuss at cse.ohio-state.edu
>>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140401/d6e20e2c/attachment-0001.html>


More information about the mvapich-discuss mailing list