[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband

Mingzhe Li li.2192 at osu.edu
Mon Mar 31 13:55:48 EDT 2014


Hi Hajime,

Thanks for reporting. I took a look at your program. There seems to be an
issue with the following code segment:

 for (i = 0; i < N_TRY; i++) {

        int local;

        rn = LCG_MUL64 * rn + LCG_ADD64;

        int tgt = (rn % (count_each * n_procs)) / count_each;

        size_t disp = rn % count_each;

        rma_op(&local, msg_size, tgt, disp, win);

        if (++n % n_outstanding == 0)

            MPI_Win_flush_all(win);

    }

  MPI_Win_flush_all(win);

Here, I saw you initialized a variable "local" for put/get/acc operations.
Since, this variable is in stack, it may not exist when this for loop
finishes, before the rma_op is completed using a flush operation. Could you
create that variable in heap memory (malloc) rather than in stack?
Mingzhe


On Sun, Mar 30, 2014 at 4:20 PM, Hajime Fujita <hfujita at uchicago.edu> wrote:

> Hi Hari,
>
> While the previous sample (mpimbench.c) worked well with
> MV2_NDREG_ENTRIES=2048, I found another example (random.c: see the attached
> file) for which the environment variable did not work.
>
> I tried MV2_NDREG_ENTRIES up to 131072, but none of them worked. Any other
> suggestions? I really appreciate your help.
>
> This program works well if
> a) run on a single host
> b) run with MVAPICH2-2.0b, even with Infiniband
>
>
> [hfujita at midway-login2 mpimbench]$ MV2_NDREG_ENTRIES=2048 mpiexec -n 2
> -hosts midway-login1,midway-login2 ./random acc
> [midway-login2:mpi_rank_1][error_sighandler] Caught error: Segmentation
> fault (signal 11)
>
>
> ============================================================
> =======================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 18343 RUNNING AT midway-login2
>
> =   EXIT CODE: 11
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ============================================================
> =======================
> [proxy:0:0 at midway-login1] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:0 at midway-login1] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at midway-login1] main (pm/pmiserv/pmip.c:206): demux engine
> error waiting for event
> [mpiexec at midway-login2] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at midway-login2] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at midway-login2] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
> completion
> [mpiexec at midway-login2] main (ui/mpich/mpiexec.c:336): process manager
> error waiting for completion
>
>
>
> Thanks,
> Hajime
>
> On 03/29/2014 10:19 AM, Hari Subramoni wrote:
>
>> Hello Hajime,
>>
>> This is not a bug with the RMA design in MVAPICH2. The application is
>> running out of memory that can be registered with the IB HCA. Can you
>> please try running your application with the environment variable
>> MV2_NDREG_ENTRIES=2048.
>>
>> Regards,
>> Hari.
>>
>>
>> On Tue, Mar 25, 2014 at 2:35 PM, Hajime Fujita <hfujita at uchicago.edu
>> <mailto:hfujita at uchicago.edu>> wrote:
>>
>>     Dear MVAPICH team,
>>
>>     I was glad to hear the release of MVAPICH2-2.0rc1, and immediately
>>     tried it. Then I found that my MPI-3 RMA program started crashing.
>>
>>     The attached simple program is enough to reproduce the issue. Here's
>>     the output:
>>
>>     [hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host
>>     midway-login1,midway-login2 ./mpimbench
>>     Message-based ping pong
>>     4, 1.272331
>>     8, 0.620984
>>     16, 0.323668
>>     32, 0.221903
>>     64, 0.076136
>>     128, 0.033388
>>     256, 0.016455
>>     512, 0.007715
>>     1024, 0.004121
>>     2048, 0.002435
>>     4096, 0.002345
>>     8192, 0.002069
>>     16384, 0.002067
>>     32768, 0.006494
>>     65536, 0.001325
>>     131072, 0.000686
>>     262144, 0.000491
>>     524288, 0.000423
>>     1048576, 0.000395
>>     RMA-based put
>>     16, 0.491239
>>     32, 0.299855
>>     64, 0.155028
>>     128, 0.078400
>>     256, 0.040418
>>     512, 0.020406
>>     1024, 0.009608
>>     2048, 0.004888
>>     4096, 0.002399
>>     8192, 0.002702
>>     [midway-login1:mpi_rank_0][__error_sighandler] Caught error:
>>     Segmentation fault (signal 11)
>>
>>     ==============================__============================
>> ==__=======================
>>
>>     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>     =   PID 9519 RUNNING AT midway-login1
>>     =   EXIT CODE: 11
>>     =   CLEANING UP REMAINING PROCESSES
>>     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>     ==============================__============================
>> ==__=======================
>>
>>     [proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb
>>     (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>>     [proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event
>>     (tools/demux/demux_poll.c:76): callback returned error status
>>     [proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206): demux engine
>>     error waiting for event
>>     [mpiexec at midway-login1] HYDT_bscu_wait_for_completion
>>     (tools/bootstrap/utils/bscu___wait.c:76): one of the processes
>>
>>     terminated badly; aborting
>>     [mpiexec at midway-login1] HYDT_bsci_wait_for_completion
>>     (tools/bootstrap/src/bsci___wait.c:23): launcher returned error
>>
>>     waiting for completion
>>     [mpiexec at midway-login1] HYD_pmci_wait_for_completion
>>     (pm/pmiserv/pmiserv_pmci.c:__218): launcher returned error waiting
>>
>>     for completion
>>     [mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336): process
>>     manager error waiting for completion
>>
>>
>>     This run was done on the UChicago Midway Cluster.
>>     http://rcc.uchicago.edu/__resources/midway_specs.html
>>
>>     <http://rcc.uchicago.edu/resources/midway_specs.html>
>>
>>     One observation is that this issue happens only when I use
>>     Infiniband for communication. If I launch the same program on a
>>     single node, it successfully finishes.
>>
>>     And here's the output of mpichversion command.
>>     [hfujita at midway-login1 mpimbench]$ mpichversion
>>     MVAPICH2 Version:       2.0rc1
>>     MVAPICH2 Release date:  Sun Mar 23 21:35:26 EDT 2014
>>     MVAPICH2 Device:        ch3:mrail
>>     MVAPICH2 configure:     --disable-option-checking
>>     --prefix=/project/aachien/__local/mvapich2-2.0rc1-gcc-4.8
>>
>>     --enable-shared --disable-checkerrors --cache-file=/dev/null
>>     --srcdir=. CC=gcc CFLAGS=-DNDEBUG -DNVALGRIND -O2 LDFLAGS=-L/lib
>>     -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib LIBS=-libmad -libumad
>>     -libverbs -lrt -lhwloc -lpthread -lhwloc
>>     CPPFLAGS=-I/project/aachien/__local/src/mvapich2-2.0rc1-gcc-
>> __4.8/src/mpid/ch3/channels/__mrail/include
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/mrail/__include
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/common/__include
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/common/__include
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/mrail/src/__gen2
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/mrail/src/__gen2
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/common/locks
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/common/locks
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _util/wrappers
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _util/wrappers
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpl/include
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpl/include
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _openpa/src
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _openpa/src
>>     -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpi/romio/include
>>     -I/include --with-cross=src/mpid/pamid/__cross/bgq8
>>     --enable-threads=multiple
>>     MVAPICH2 CC:    gcc -DNDEBUG -DNVALGRIND -O2   -DNDEBUG -DNVALGRIND
>> -O2
>>     MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND
>>     MVAPICH2 F77:   gfortran   -O2
>>     MVAPICH2 FC:    gfortran
>>
>>     If you need more explanation or information please let me know.
>>
>>
>>     Thanks,
>>     Hajime
>>
>>     _______________________________________________
>>     mvapich-discuss mailing list
>>     mvapich-discuss at cse.ohio-state.edu
>>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140331/7339f986/attachment.html>


More information about the mvapich-discuss mailing list