[mvapich-discuss] segmentation falut in MPI_Win_fence with #PE = 96

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed Aug 12 14:53:07 EDT 2009


> thanks for the note. The error is not present with the trunk version!

Thanks for confirming that it works with the trunk version. We also
verified it.

FYI, for the RC1 version, it will also work by increasing the
on-demand-threshold to the same as the number of processes. This issue has
been fixed in the trunk and will be available in the RC2 version.

> configure.in:73: the top level
> autom4te: /usr/bin/m4 failed with exit status: 1
>
> I solved by reverting to autoconf 2.6.3 ...
>
> Should I post this to the mpich2 mailing list or is there a difference
> between the mpich2 and mvapich2 configure scripts?

Thanks for indicating this. Pavan has created a trac entry to resolve
this.

Since mvapich2 releases are typically based on previous mpich2 releases
(for example, mvapich2 1.4 is based on mpich2 1.0.8, not 1.1.1), you may
see some differences in the mpich2 and mvapich2 configure scripts.

Thanks,

DK


> Thanks,
> Dorian
>
>
> Dhabaleswar Panda wrote:
> > Dorian,
> >
> > Thanks for your report. Do you see this error with the latest trunk
> > version of MVAPICH2 1.4? After the RC1 release, some fixes have gone into
> > the trunk. We are in preparation to bring out RC2.
> >
> > We will also be taking a look at this issue in the mean time.
> >
> > Thanks,
> >
> > DK
> >
> > On Tue, 11 Aug 2009, Dorian Krause wrote:
> >
> >
> >> Dear list members,
> >>
> >> I have a code which uses MPI_Put + MPI_Win_fence for communication. The
> >> code runs fine with OpenMPI (tested for 8, 16, 32, 48, 64, 96 processors
> >> without problems) and with mvapich for less than 96 processors (the
> >> maximal number I have currently access to). The core I got shows me the
> >> following:
> >>
> >> #0  Post_Put_Put_Get_List (winptr=0x6e06a0, size=-1, dreg_tmp=<value
> >> optimized out>, vc_ptr=0x10013c60, local_buf=0x7ffffac96e10,
> >> remote_buf=0x7ffffac96e08, length=4, lkeys=0x7ffffac96e1c,
> >>     rkeys=0x7ffffac96e18, use_multi=0) at rdma_iba_1sc.c:1137
> >> 1137            ++(vc_ptr->mrail.rails[rail].postsend_times_1sc);
> >> (gdb) p rail
> >> No symbol "rail" in current context.
> >> (gdb) p vc_ptr
> >> $1 = (MPIDI_VC_t *) 0x10013c60
> >> Current language:  auto; currently c
> >> (gdb) p vc_ptr->mrail
> >> $2 = {num_rails = 1, rails = 0x0, next_packet_expected = 0,
> >> next_packet_tosend = 0, outstanding_eager_vbufs = 0, coalesce_vbuf =
> >> 0x0, rfp = {RDMA_send_buf_DMA = 0x0, RDMA_recv_buf_DMA = 0x0,
> >>     RDMA_send_buf = 0x0, RDMA_recv_buf = 0x0, RDMA_send_buf_mr = {0x0,
> >> 0x0, 0x0, 0x0}, RDMA_recv_buf_mr = {0x0, 0x0, 0x0, 0x0},
> >> RDMA_remote_buf_rkey = {0, 0, 0, 0}, rdma_credit = 0 '\0',
> >>     remote_RDMA_buf = 0x0, phead_RDMA_send = 0, ptail_RDMA_send = 0,
> >> p_RDMA_recv = 0, p_RDMA_recv_tail = 0, eager_start_cnt = 0,
> >> in_polling_set = 0, cached_outgoing = 0x0, cached_incoming = 0x0,
> >>     cached_hit = 0, cached_miss = 0}, srp = {credits = 0x0}, cmanager =
> >> {num_channels = 0, num_local_pollings = 0, msg_channels = 0x0,
> >> next_arriving = 0x0, inqueue = 0, prev = 0x0, next = 0x0,
> >>     pending_vbuf = 0, vc = 0x0}, packetized_recv = 0x0, sreq_head = 0x0,
> >> sreq_tail = 0x0, nextflow = 0x0, inflow = 0, remote_vc_addr = 0}
> >> (gdb) p vc_ptr->mrail.rails
> >> $3 = (struct mrail_rail *) 0x0
> >> (gdb) bt
> >> #0  Post_Put_Put_Get_List (winptr=0x6e06a0, size=-1, dreg_tmp=<value
> >> optimized out>, vc_ptr=0x10013c60, local_buf=0x7ffffac96e10,
> >> remote_buf=0x7ffffac96e08, length=4, lkeys=0x7ffffac96e1c,
> >>     rkeys=0x7ffffac96e18, use_multi=0) at rdma_iba_1sc.c:1137
> >> #1  0x000000000044a09a in MPIDI_CH3I_RDMA_post (win_ptr=0x6e06a0,
> >> target_rank=0) at rdma_iba_1sc.c:476
> >> #2  0x000000000045f434 in MPIDI_Win_fence (assert=12288,
> >> win_ptr=0x6e06a0) at ch3u_rma_sync.c:165
> >> #3  0x000000000041fecd in PMPI_Win_fence (assert=12288, win=-1610612736)
> >> at win_fence.c:108
> >> #4  0x0000000000409dfc in hgc::OscPt2PtCommunicationGraph::sendP2M
> >> (this=0x10806650, list=@0x10278fe0) at comm/Window.hh:81
> >> #5  0x0000000000404a5d in main (argc=2, argv=0x7ffffac97398) at
> >> Scale4Bonn/scale.cc:129
> >>
> >>
> >> Obviously vc_ptr->mrail.rails is NULL. Can you help me to understand why?
> >>
> >> The relevant code snippet is
> >>
> >>         mWindow.fence(MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE);
> >>         for(int k = 0; k < mTop.numprocs(); ++k) {
> >>                 if(1 == mMustResend[k]) {
> >>                         mWindow.put(&mSendBuf[k], 1, MPI_INT, k,
> >>                                 mLocalGroup.myrank(), 1, MPI_INT);
> >>                 }
> >>         }
> >>         mWindow.fence(MPI_MODE_NOSTORE | MPI_MODE_NOSUCCEED |
> >> MPI_MODE_NOPUT);
> >>
> >> and on the receiver side I just have
> >>
> >>         mWindow.fence(MPI_MODE_NOSTORE | MPI_MODE_NOPRECEDE);
> >>         mWindow.fence(MPI_MODE_NOSTORE | MPI_MODE_NOSUCCEED |
> >> MPI_MODE_NOPUT);
> >>
> >> mWindow is an instance of a wrapper class about MPI_Window, The
> >> functions put and fence directly map to MPI_Win_put and MPI_Win_fence ...
> >>
> >> For this test I used mvapich2 1.4 rc1 configured with
> >>
> >> ./configure --prefix=/home/kraused/mvapich2/1.4rc1/gcc-4.1.2/ CFLAGS=-O0
> >> -ggdb CXXFLAGS=-ggdb FCFLAGS=-ggdb
> >>
> >> Thanks for your help!
> >>
> >> Regards,
> >> Dorian
> >>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
> >
> >
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list