[mvapich-discuss] mvapich2/2.3.2 failing with internal segfault on ch3:mrail build

Subramoni, Hari subramoni.1 at osu.edu
Wed Sep 9 07:22:02 EDT 2020


Hi, Chris.

Sorry to hear that you are facing issues with MVAPICH2.

Can you please try to set MV2_NDREG_ENTRIES=16384 MV2_NDREG_ENTRIES_MAX=16384 and see if that works around the issue for you?

In the meantime, could you provide us with the details of how to reproduce the issue locally (how to download, build and run the application) so that we can see what could be going on?
Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Stone, Christopher P
Sent: Tuesday, September 8, 2020 9:40 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] mvapich2/2.3.2 failing with internal segfault on ch3:mrail build

Good morning,

A user at PACE is encountering a run-time failure with the application (enzo) and I have reproduced the issue multiple times. The application is failing at random points in the application during certain MPI calls. Stack traces indicate that all are failing in the MPIDI_CH3_Rendezvouz_r3_recv_data function in our mvapich2/2.3.2 build. (I reproduced on 2.3.4 as well.)

I rebuilt mvapich2/2.3.2 with the same configuration but with debug symbols (--enable-g=debug). (Still building with --enable-fast=all.)

The stack trace from the failing process's core shows:

>>>

(gdb) where

#0  MPIDI_CH3_Rendezvouz_r3_recv_data (vc=vc at entry=0x5349fb8, buffer=buffer at entry=0x2aaac36e1170)

    at src/mpid/ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c:959

#1  0x00002aaaac427da5 in handle_read_individual (header_type=<synthetic pointer>, buffer=0x2aaac36e1170, vc=0x5349fb8)

    at src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:1488

#2  handle_read (vc=0x5349fb8, buffer=0x2aaac36e1170) at src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:1347

#3  0x00002aaaac428822 in MPIDI_CH3I_Progress (is_blocking=is_blocking at entry=1, state=state at entry=0x7fffffd4c80c)

    at src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:282

#4  0x00002aaaac3bb92d in MPIC_Wait (request_ptr=0x9e001e0, errflag=errflag at entry=0x7fffffd4cb3c) at src/mpi/coll/helper_fns.c:296

#5  0x00002aaaac3bbfbd in MPIC_Sendrecv (sendbuf=sendbuf at entry=0x8978d88, sendcount=sendcount at entry=1, sendtype=sendtype at entry=1275070473,

    dest=dest at entry=70, sendtag=sendtag at entry=7, recvbuf=recvbuf at entry=0x8978d80, recvcount=74, recvtype=1275070473, source=70, recvtag=7,

    comm_ptr=0x2aaaac7be980 <MPID_Comm_direct+2240>, status=0x7fffffd4c940, errflag=0x7fffffd4cb3c) at src/mpi/coll/helper_fns.c:558

#6  0x00002aaaac058977 in MPIR_Allgather_RD_MV2 (sendbuf=sendbuf at entry=0x7fffffd4cbb8, sendcount=sendcount at entry=1,

    sendtype=sendtype at entry=1275070473, recvbuf=recvbuf at entry=0x8978b50, recvcount=recvcount at entry=1, recvtype=recvtype at entry=1275070473,

    comm_ptr=0x2aaaac7be980 <MPID_Comm_direct+2240>, errflag=0x7fffffd4cb3c) at src/mpi/coll/allgather_osu.c:570

#7  0x00002aaaac05c691 in MPIR_Allgather_index_tuned_intra_MV2 (sendbuf=sendbuf at entry=0x7fffffd4cbb8, sendcount=sendcount at entry=1,

    sendtype=1275070473, recvbuf=recvbuf at entry=0x111647b0, recvcount=1, recvtype=1275070473, comm_ptr=0x2aaaac7bf240 <MPID_Comm_builtin>,

    errflag=0x7fffffd4cb3c) at src/mpi/coll/allgather_osu.c:2461

#8  0x00002aaaac05cfd3 in MPIR_Allgather_MV2 (sendbuf=0x7fffffd4cbb8, sendcount=1, sendtype=<optimized out>, recvbuf=0x111647b0, recvcount=1,

    recvtype=1275070473, comm_ptr=0x2aaaac7bf240 <MPID_Comm_builtin>, errflag=0x7fffffd4cb3c) at src/mpi/coll/allgather_osu.c:2582

#9  0x00002aaaac025169 in MPIR_Allgather_impl (sendbuf=sendbuf at entry=0x7fffffd4cbb8, sendcount=sendcount at entry=1, sendtype=sendtype at entry=1275070473,

    recvbuf=recvbuf at entry=0x111647b0, recvcount=recvcount at entry=1, recvtype=recvtype at entry=1275070473, comm_ptr=0x2aaaac7bf240 <MPID_Comm_builtin>,

    errflag=0x7fffffd4cb3c) at src/mpi/coll/allgather.c:845

#10 0x00002aaaac0259e2 in PMPI_Allgather () at src/mpi/coll/allgather.c:997

#11 0x00000000004706c9 in CommunicationShareGrids(HierarchyEntry**, long long, long long) () at CommunicationShareGrids.C:137

#12 0x000000000066302d in RebuildHierarchy(TopGridData*, LevelHierarchyEntry**, long long) () at RebuildHierarchy.C:454

#13 0x00000000004a3a77 in EvolveHierarchy(HierarchyEntry&, TopGridData&, ExternalBoundary*, ImplicitProblemABC*, LevelHierarchyEntry**, double) ()

    at EvolveHierarchy.C:584

#14 0x00000000004195cb in main () at enzo.C:793

#15 0x00002aaaad0c93d5 in __libc_start_main () from /lib64/libc.so.6

#16 0x000000000042fbbf in _start () at enzo.C:932

(gdb) l 950

945 #undef FCNAME

946 #define FCNAME MPL_QUOTE(FUNCNAME)

947 int MPIDI_CH3_Rendezvouz_r3_recv_data(MPIDI_VC_t * vc, vbuf * buffer)

948 {

949     int mpi_errno = MPI_SUCCESS;

950     int skipsize = sizeof(MPIDI_CH3_Pkt_rndv_r3_data_t);

951     int nb, complete;

952     MPID_Request *rreq;

953     MPIDI_STATE_DECL(MPID_STATE_MPIDI_CH3I_RNDV_R3_RCV_DATA);

954     MPIDI_FUNC_ENTER(MPID_STATE_MPIDI_CH3I_RNDV_R3_RCV_DATA);

(gdb)

955     MPID_Request_get_ptr(((MPIDI_CH3_Pkt_rndv_r3_data_t *) (buffer->

956                                                             pheader))->

957                         receiver_req_id, rreq);

958

959     if (!(MV2_RNDV_PROTOCOL_R3 == rreq->mrail.protocol ||

960           MV2_RNDV_PROTOCOL_RPUT == rreq->mrail.protocol)) {

961         int rank;

962         UPMI_GET_RANK(&rank);

963

964         DEBUG_PRINT( "[rank %d]get wrong req protocol, req %08x, protocol %d\n", rank,

(gdb) p rreq

$12 = (MPID_Request *) 0x0

(gdb) p *(MPIDI_CH3_Pkt_rndv_r3_data_t *) (buffer->pheader)

$13 = {type = 22 '\026', seqnum = 5171, acknum = 5173, remote_credit = 4 '\004', rdma_credit = 0 '\000', src = {smp_index = 87334840,

    rank = 87334840, vc_addr = 87334840}, vbuf_credit = 0 '\000', rail = 0 '\000', receiver_req_id = 0,

  send_req_id = 0x2aaab047656b <mlx5_poll_cq_1+1515>, csend_req_id = 0x0}

(gdb)

<<<

The segfault is due to the null rreq object but I was not able to deduce why this object is null on exit from the MPID_Request_get_ptr function.

Here are my build options:

>>>

[cs199 at login-hive1 mvapich2-2.3.2]$ mpiname -a

MVAPICH2 2.3.2 Fri August 9 22:00:00 EST 2019 ch3:mrail



Compilation

CC: gcc    -DNDEBUG -DNVALGRIND -g -O2

CXX: g++   -DNDEBUG -DNVALGRIND -g -O2

F77: mpif77 -L/lib -L/lib   -g -O2

FC: gfortran   -g -O2



Configuration

--prefix=<hidden>/builds/mvapich2/2.3.2 --enable-shared --enable-romio --disable-silent-rules --disable-new-dtags --enable-fortran=all --enable-threads=multiple --with-ch3-rank-bits=32 --enable-wrapper-rpath=yes --disable-alloca --enable-fast=all --disable-cuda --enable-registration-cache --with-pbs=/opt/torque/current --with-device=ch3:mrail --with-rdma=gen2 --disable-mcast --with-file-system=nfs+ufs --enable-g=debug CC=gcc CXX=g++ FC=gfortran

<<<


Does this failure indicate an issue with our mvapich2 build / configuration? And is there a workaround for this issue that we can try on the build or application side?

Thanks for any assistance or guidance.

Chris Stone



Christopher Stone, PhD
Software and Collaboration Support (SCS) Team
Partnership for an Advanced Computing Environment (PACE)
Georgia Institute of Technology
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200909/cdcffe40/attachment-0001.html>


More information about the mvapich-discuss mailing list