[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband
Mingzhe Li
li.2192 at osu.edu
Mon Mar 31 13:55:48 EDT 2014
Hi Hajime,
Thanks for reporting. I took a look at your program. There seems to be an
issue with the following code segment:
for (i = 0; i < N_TRY; i++) {
int local;
rn = LCG_MUL64 * rn + LCG_ADD64;
int tgt = (rn % (count_each * n_procs)) / count_each;
size_t disp = rn % count_each;
rma_op(&local, msg_size, tgt, disp, win);
if (++n % n_outstanding == 0)
MPI_Win_flush_all(win);
}
MPI_Win_flush_all(win);
Here, I saw you initialized a variable "local" for put/get/acc operations.
Since, this variable is in stack, it may not exist when this for loop
finishes, before the rma_op is completed using a flush operation. Could you
create that variable in heap memory (malloc) rather than in stack?
Mingzhe
On Sun, Mar 30, 2014 at 4:20 PM, Hajime Fujita <hfujita at uchicago.edu> wrote:
> Hi Hari,
>
> While the previous sample (mpimbench.c) worked well with
> MV2_NDREG_ENTRIES=2048, I found another example (random.c: see the attached
> file) for which the environment variable did not work.
>
> I tried MV2_NDREG_ENTRIES up to 131072, but none of them worked. Any other
> suggestions? I really appreciate your help.
>
> This program works well if
> a) run on a single host
> b) run with MVAPICH2-2.0b, even with Infiniband
>
>
> [hfujita at midway-login2 mpimbench]$ MV2_NDREG_ENTRIES=2048 mpiexec -n 2
> -hosts midway-login1,midway-login2 ./random acc
> [midway-login2:mpi_rank_1][error_sighandler] Caught error: Segmentation
> fault (signal 11)
>
>
> ============================================================
> =======================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 18343 RUNNING AT midway-login2
>
> = EXIT CODE: 11
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ============================================================
> =======================
> [proxy:0:0 at midway-login1] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:0 at midway-login1] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at midway-login1] main (pm/pmiserv/pmip.c:206): demux engine
> error waiting for event
> [mpiexec at midway-login2] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at midway-login2] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at midway-login2] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
> completion
> [mpiexec at midway-login2] main (ui/mpich/mpiexec.c:336): process manager
> error waiting for completion
>
>
>
> Thanks,
> Hajime
>
> On 03/29/2014 10:19 AM, Hari Subramoni wrote:
>
>> Hello Hajime,
>>
>> This is not a bug with the RMA design in MVAPICH2. The application is
>> running out of memory that can be registered with the IB HCA. Can you
>> please try running your application with the environment variable
>> MV2_NDREG_ENTRIES=2048.
>>
>> Regards,
>> Hari.
>>
>>
>> On Tue, Mar 25, 2014 at 2:35 PM, Hajime Fujita <hfujita at uchicago.edu
>> <mailto:hfujita at uchicago.edu>> wrote:
>>
>> Dear MVAPICH team,
>>
>> I was glad to hear the release of MVAPICH2-2.0rc1, and immediately
>> tried it. Then I found that my MPI-3 RMA program started crashing.
>>
>> The attached simple program is enough to reproduce the issue. Here's
>> the output:
>>
>> [hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host
>> midway-login1,midway-login2 ./mpimbench
>> Message-based ping pong
>> 4, 1.272331
>> 8, 0.620984
>> 16, 0.323668
>> 32, 0.221903
>> 64, 0.076136
>> 128, 0.033388
>> 256, 0.016455
>> 512, 0.007715
>> 1024, 0.004121
>> 2048, 0.002435
>> 4096, 0.002345
>> 8192, 0.002069
>> 16384, 0.002067
>> 32768, 0.006494
>> 65536, 0.001325
>> 131072, 0.000686
>> 262144, 0.000491
>> 524288, 0.000423
>> 1048576, 0.000395
>> RMA-based put
>> 16, 0.491239
>> 32, 0.299855
>> 64, 0.155028
>> 128, 0.078400
>> 256, 0.040418
>> 512, 0.020406
>> 1024, 0.009608
>> 2048, 0.004888
>> 4096, 0.002399
>> 8192, 0.002702
>> [midway-login1:mpi_rank_0][__error_sighandler] Caught error:
>> Segmentation fault (signal 11)
>>
>> ==============================__============================
>> ==__=======================
>>
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = PID 9519 RUNNING AT midway-login1
>> = EXIT CODE: 11
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ==============================__============================
>> ==__=======================
>>
>> [proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb
>> (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>> [proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event
>> (tools/demux/demux_poll.c:76): callback returned error status
>> [proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206): demux engine
>> error waiting for event
>> [mpiexec at midway-login1] HYDT_bscu_wait_for_completion
>> (tools/bootstrap/utils/bscu___wait.c:76): one of the processes
>>
>> terminated badly; aborting
>> [mpiexec at midway-login1] HYDT_bsci_wait_for_completion
>> (tools/bootstrap/src/bsci___wait.c:23): launcher returned error
>>
>> waiting for completion
>> [mpiexec at midway-login1] HYD_pmci_wait_for_completion
>> (pm/pmiserv/pmiserv_pmci.c:__218): launcher returned error waiting
>>
>> for completion
>> [mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336): process
>> manager error waiting for completion
>>
>>
>> This run was done on the UChicago Midway Cluster.
>> http://rcc.uchicago.edu/__resources/midway_specs.html
>>
>> <http://rcc.uchicago.edu/resources/midway_specs.html>
>>
>> One observation is that this issue happens only when I use
>> Infiniband for communication. If I launch the same program on a
>> single node, it successfully finishes.
>>
>> And here's the output of mpichversion command.
>> [hfujita at midway-login1 mpimbench]$ mpichversion
>> MVAPICH2 Version: 2.0rc1
>> MVAPICH2 Release date: Sun Mar 23 21:35:26 EDT 2014
>> MVAPICH2 Device: ch3:mrail
>> MVAPICH2 configure: --disable-option-checking
>> --prefix=/project/aachien/__local/mvapich2-2.0rc1-gcc-4.8
>>
>> --enable-shared --disable-checkerrors --cache-file=/dev/null
>> --srcdir=. CC=gcc CFLAGS=-DNDEBUG -DNVALGRIND -O2 LDFLAGS=-L/lib
>> -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib LIBS=-libmad -libumad
>> -libverbs -lrt -lhwloc -lpthread -lhwloc
>> CPPFLAGS=-I/project/aachien/__local/src/mvapich2-2.0rc1-gcc-
>> __4.8/src/mpid/ch3/channels/__mrail/include
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/mrail/__include
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/common/__include
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/common/__include
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/mrail/src/__gen2
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/ch3/channels/mrail/src/__gen2
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/common/locks
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpid/common/locks
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _util/wrappers
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _util/wrappers
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpl/include
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpl/include
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _openpa/src
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _openpa/src
>> -I/project/aachien/local/src/__mvapich2-2.0rc1-gcc-4.8/src/_
>> _mpi/romio/include
>> -I/include --with-cross=src/mpid/pamid/__cross/bgq8
>> --enable-threads=multiple
>> MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2 -DNDEBUG -DNVALGRIND
>> -O2
>> MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND
>> MVAPICH2 F77: gfortran -O2
>> MVAPICH2 FC: gfortran
>>
>> If you need more explanation or information please let me know.
>>
>>
>> Thanks,
>> Hajime
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140331/7339f986/attachment.html>
More information about the mvapich-discuss
mailing list