[mvapich-discuss] 2.0rc1: Crash in MPI-3 RMA program over Infiniband

Hajime Fujita hfujita at uchicago.edu
Tue Apr 1 10:32:33 EDT 2014


Hi Mingzhe,

Thank you for your advice. Your advice is exactly correct.

However, even I move the declaration of `local` to the outside of the 
for loop (i.e. beginning of the main function / making it a global 
variable), it still crashes in the same way. Making it a heap variable 
did not help, either.

Thus I suspect there's another cause...


Thanks,
Hajime

On 03/31/2014 12:55 PM, Mingzhe Li wrote:
> Hi Hajime,
>
> Thanks for reporting. I took a look at your program. There seems to be
> an issue with the following code segment:
>
>   for (i = 0; i < N_TRY; i++) {
>
>          int local;
>
>          rn = LCG_MUL64 * rn + LCG_ADD64;
>
>          int tgt = (rn % (count_each * n_procs)) / count_each;
>
>          size_t disp = rn % count_each;
>
>          rma_op(&local, msg_size, tgt, disp, win);
>
>          if (++n % n_outstanding == 0)
>
>              MPI_Win_flush_all(win);
>
>      }
>
>    MPI_Win_flush_all(win);
>
> Here, I saw you initialized a variable "local" for put/get/acc
> operations. Since, this variable is in stack, it may not exist when this
> for loop finishes, before the rma_op is completed using a flush
> operation. Could you create that variable in heap memory (malloc) rather
> than in stack?
>
> Mingzhe
>
>
> On Sun, Mar 30, 2014 at 4:20 PM, Hajime Fujita <hfujita at uchicago.edu
> <mailto:hfujita at uchicago.edu>> wrote:
>
>     Hi Hari,
>
>     While the previous sample (mpimbench.c) worked well with
>     MV2_NDREG_ENTRIES=2048, I found another example (random.c: see the
>     attached file) for which the environment variable did not work.
>
>     I tried MV2_NDREG_ENTRIES up to 131072, but none of them worked. Any
>     other suggestions? I really appreciate your help.
>
>     This program works well if
>     a) run on a single host
>     b) run with MVAPICH2-2.0b, even with Infiniband
>
>
>     [hfujita at midway-login2 mpimbench]$ MV2_NDREG_ENTRIES=2048 mpiexec -n
>     2 -hosts midway-login1,midway-login2 ./random acc
>     [midway-login2:mpi_rank_1][__error_sighandler] Caught error:
>     Segmentation fault (signal 11)
>
>
>     ==============================__==============================__=======================
>     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>     =   PID 18343 RUNNING AT midway-login2
>
>     =   EXIT CODE: 11
>     =   CLEANING UP REMAINING PROCESSES
>     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>     ==============================__==============================__=======================
>     [proxy:0:0 at midway-login1] HYD_pmcd_pmip_control_cmd_cb
>     (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>     [proxy:0:0 at midway-login1] HYDT_dmxu_poll_wait_for_event
>     (tools/demux/demux_poll.c:76): callback returned error status
>     [proxy:0:0 at midway-login1] main (pm/pmiserv/pmip.c:206): demux engine
>     error waiting for event
>     [mpiexec at midway-login2] HYDT_bscu_wait_for_completion
>     (tools/bootstrap/utils/bscu___wait.c:76): one of the processes
>     terminated badly; aborting
>     [mpiexec at midway-login2] HYDT_bsci_wait_for_completion
>     (tools/bootstrap/src/bsci___wait.c:23): launcher returned error
>     waiting for completion
>     [mpiexec at midway-login2] HYD_pmci_wait_for_completion
>     (pm/pmiserv/pmiserv_pmci.c:__218): launcher returned error waiting
>     for completion
>     [mpiexec at midway-login2] main (ui/mpich/mpiexec.c:336): process
>     manager error waiting for completion
>
>
>
>     Thanks,
>     Hajime
>
>     On 03/29/2014 10:19 AM, Hari Subramoni wrote:
>
>         Hello Hajime,
>
>         This is not a bug with the RMA design in MVAPICH2. The
>         application is
>         running out of memory that can be registered with the IB HCA.
>         Can you
>         please try running your application with the environment variable
>         MV2_NDREG_ENTRIES=2048.
>
>         Regards,
>         Hari.
>
>
>         On Tue, Mar 25, 2014 at 2:35 PM, Hajime Fujita
>         <hfujita at uchicago.edu <mailto:hfujita at uchicago.edu>
>         <mailto:hfujita at uchicago.edu <mailto:hfujita at uchicago.edu>>> wrote:
>
>              Dear MVAPICH team,
>
>              I was glad to hear the release of MVAPICH2-2.0rc1, and
>         immediately
>              tried it. Then I found that my MPI-3 RMA program started
>         crashing.
>
>              The attached simple program is enough to reproduce the
>         issue. Here's
>              the output:
>
>              [hfujita at midway-login1 mpimbench]$ mpiexec -n 2 -host
>              midway-login1,midway-login2 ./mpimbench
>              Message-based ping pong
>              4, 1.272331
>              8, 0.620984
>              16, 0.323668
>              32, 0.221903
>              64, 0.076136
>              128, 0.033388
>              256, 0.016455
>              512, 0.007715
>              1024, 0.004121
>              2048, 0.002435
>              4096, 0.002345
>              8192, 0.002069
>              16384, 0.002067
>              32768, 0.006494
>              65536, 0.001325
>              131072, 0.000686
>              262144, 0.000491
>              524288, 0.000423
>              1048576, 0.000395
>              RMA-based put
>              16, 0.491239
>              32, 0.299855
>              64, 0.155028
>              128, 0.078400
>              256, 0.040418
>              512, 0.020406
>              1024, 0.009608
>              2048, 0.004888
>              4096, 0.002399
>              8192, 0.002702
>              [midway-login1:mpi_rank_0][____error_sighandler] Caught error:
>              Segmentation fault (signal 11)
>
>
>         ==============================____============================__==__=======================
>
>              =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>              =   PID 9519 RUNNING AT midway-login1
>              =   EXIT CODE: 11
>              =   CLEANING UP REMAINING PROCESSES
>              =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
>         ==============================____============================__==__=======================
>
>              [proxy:0:1 at midway-login2] HYD_pmcd_pmip_control_cmd_cb
>              (pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>              [proxy:0:1 at midway-login2] HYDT_dmxu_poll_wait_for_event
>              (tools/demux/demux_poll.c:76): callback returned error status
>              [proxy:0:1 at midway-login2] main (pm/pmiserv/pmip.c:206):
>         demux engine
>              error waiting for event
>              [mpiexec at midway-login1] HYDT_bscu_wait_for_completion
>              (tools/bootstrap/utils/bscu_____wait.c:76): one of the
>         processes
>
>              terminated badly; aborting
>              [mpiexec at midway-login1] HYDT_bsci_wait_for_completion
>              (tools/bootstrap/src/bsci_____wait.c:23): launcher returned
>         error
>
>              waiting for completion
>              [mpiexec at midway-login1] HYD_pmci_wait_for_completion
>              (pm/pmiserv/pmiserv_pmci.c:____218): launcher returned
>         error waiting
>
>              for completion
>              [mpiexec at midway-login1] main (ui/mpich/mpiexec.c:336): process
>              manager error waiting for completion
>
>
>              This run was done on the UChicago Midway Cluster.
>         http://rcc.uchicago.edu/____resources/midway_specs.html
>         <http://rcc.uchicago.edu/__resources/midway_specs.html>
>
>              <http://rcc.uchicago.edu/__resources/midway_specs.html
>         <http://rcc.uchicago.edu/resources/midway_specs.html>>
>
>              One observation is that this issue happens only when I use
>              Infiniband for communication. If I launch the same program on a
>              single node, it successfully finishes.
>
>              And here's the output of mpichversion command.
>              [hfujita at midway-login1 mpimbench]$ mpichversion
>              MVAPICH2 Version:       2.0rc1
>              MVAPICH2 Release date:  Sun Mar 23 21:35:26 EDT 2014
>              MVAPICH2 Device:        ch3:mrail
>              MVAPICH2 configure:     --disable-option-checking
>              --prefix=/project/aachien/____local/mvapich2-2.0rc1-gcc-4.8
>
>              --enable-shared --disable-checkerrors --cache-file=/dev/null
>              --srcdir=. CC=gcc CFLAGS=-DNDEBUG -DNVALGRIND -O2
>         LDFLAGS=-L/lib
>              -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib LIBS=-libmad -libumad
>              -libverbs -lrt -lhwloc -lpthread -lhwloc
>
>         CPPFLAGS=-I/project/aachien/____local/src/mvapich2-2.0rc1-gcc-____4.8/src/mpid/ch3/channels/____mrail/include
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/mrail/____include
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/common/____include
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/common/____include
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/mrail/src/____gen2
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/ch3/channels/mrail/src/____gen2
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/common/locks
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpid/common/locks
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____util/wrappers
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____util/wrappers
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpl/include
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpl/include
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____openpa/src
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____openpa/src
>
>         -I/project/aachien/local/src/____mvapich2-2.0rc1-gcc-4.8/src/____mpi/romio/include
>              -I/include --with-cross=src/mpid/pamid/____cross/bgq8
>              --enable-threads=multiple
>              MVAPICH2 CC:    gcc -DNDEBUG -DNVALGRIND -O2   -DNDEBUG
>         -DNVALGRIND -O2
>              MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND
>              MVAPICH2 F77:   gfortran   -O2
>              MVAPICH2 FC:    gfortran
>
>              If you need more explanation or information please let me know.
>
>
>              Thanks,
>              Hajime
>
>              _________________________________________________
>              mvapich-discuss mailing list
>         mvapich-discuss at cse.ohio-__state.edu
>         <mailto:mvapich-discuss at cse.ohio-state.edu>
>              <mailto:mvapich-discuss at cse.__ohio-state.edu
>         <mailto:mvapich-discuss at cse.ohio-state.edu>>
>         http://mailman.cse.ohio-state.__edu/mailman/listinfo/mvapich-__discuss
>         <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
>
>
>
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>




More information about the mvapich-discuss mailing list