[mvapich-discuss] Segfault seen during connect/accept with dynamic processes

Neil Spruit nrspruit at gmail.com
Tue Feb 24 14:46:09 EST 2015


Hello,

I have been working on a Host/Client setup with mvapich and I have run into
an issue that seems to come up when performing connect/accept between two
mpi processes, specifically when using MPI_THREAD_MULTIPLE.

The scenario is I have one process on my Host system connecting to an MPI
process launched separately on a remote machine. Sometimes this connection
succeeds and the connection is made. However, some of the time during the
connection there is a segmentation fault that occurs within mvapich.
Specifically the error is tied to the change of a *vc->state* from active
to inactive which causes the code to attempt to free the *vc->pg* which is
NULL and there is no check in the code to ensure this attempt to free a
NULL does not occur, hence the segfault.

The issue occurs in /src/mpid/ch3/src/mpid_vc.c: from line 275-292

if (vc->state == MPIDI_VC_STATE_ACTIVE ||
            vc->state == MPIDI_VC_STATE_LOCAL_ACTIVE ||
    vc->state == MPIDI_VC_STATE_REMOTE_CLOSE)
#endif
{
#ifdef _ENABLE_XRC_
        PRINT_DEBUG(DEBUG_XRC_verbose>0, "SendClose2 %d 0x%08x %d\n",
vc->pg_rank, vc->ch.xrc_flags,
                vc->ch.state);
#endif
    MPIDI_CH3U_VC_SendClose( vc, i );
}
else
{
                    MPIDI_PG_release_ref(vc->pg, &inuse);
                    if (inuse == 0)
                    {
                        MPIDI_PG_Destroy(vc->pg);
                    }

In the above section of code the vc->state during my program goes from
being MPIDI_VC_STATE_ACTIVE to being MPIDI_VC_STATE_INACTIVE. This causes
the if check to instead go to the else condition where
 MPIDI_PG_release_ref attempts to use the vc->pg variable, which is always
NULL in my case ( I noticed most uses of this set pg to NULL except for a
very few cases).

This does not happen all the time, in fact in most cases I see this 1/10
connection calls have this occur. Also, this only fails in the
MPI_Comm_connect call with the below backtrace.

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f53a2336700 (LWP 5156)]

0x00007f53a58c67c7 in MPID_VCRT_Release () from
/usr/local/lib/libmpich.so.12

(gdb) bt

#0  0x00007f53a58c67c7 in MPID_VCRT_Release () from
/usr/local/lib/libmpich.so.12

#1  0x00007f53a5884103 in MPIR_Comm_delete_internal () from
/usr/local/lib/libmpich.so.12

#2  0x00007f53a58a5e0c in MPIDI_Comm_connect () from
/usr/local/lib/libmpich.so.12

#3  0x00007f53a58c20f3 in MPID_Comm_connect () from
/usr/local/lib/libmpich.so.12

#4  0x00007f53a5b52956 in PMPI_Comm_connect () from
/usr/local/lib/libmpich.so.12

#5  0x00007f53a6564255 in _MPIComm::Connect (this=0x7f5398004010,

    address=<value optimized out>, port=<value optimized out>


Just for the sake of trying a quick fix I added a check in this code
locally to verify the vc->pg was not NULL before attempting to use it and
the segfault went away and I have not seen any issue/errors in the
operation of the MPI code since that change, however that is fixing a
symptom not the root of the issue.


The question I have is why would this connection go from being active to
inactive and then get into this state?


In all cases I see this sequence of events(added prints to the code):


I just set a NULL process group for vc 0x7f7920000940

setting vc 0x7f7920000940 active

vc 0x7f7920000940 vc->state is 1

vc 0x7f7920000940 vc->state is 1

NULL process group


Which implies that the state should have been set to active, but somewhere
else in the code the state went to inactive before the state check in this
code.


I am working on a reliable reproducer, but it has been tricky since it is
most definitely a timing issue.


Please let me know what you can find out about this situation, I am using
mvapich2-2.0.1, but I have also seen this issue in mvapich2-2.1rc1 as well.


Thank you very much for your time.


Respectfully,

Neil Spruit
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150224/ce30457d/attachment.html>


More information about the mvapich-discuss mailing list