[mvapich-discuss] Segfault seen during connect/accept with dynamic processes

Neil Spruit nrspruit at gmail.com
Tue Mar 3 17:04:41 EST 2015


Hello Hari,

Sorry for not getting back sooner. This is still an issue I need solved,
but it has been difficult to produce a reliable reproducer due to it being
mostly a timing condition. I am close, but it is not quite reliable enough
yet. I will send it out as soon as it is ready so you will be able to
verify the fix.

To just make sure, the patch to resolve the error I see would be the
following:
converting the code in /src/mpid/ch3/src/mpid_vc.c: from line 275-292

if (vc->state == MPIDI_VC_STATE_ACTIVE ||
            vc->state == MPIDI_VC_STATE_LOCAL_ACTIVE ||
    vc->state == MPIDI_VC_STATE_REMOTE_CLOSE)
#endif
{
#ifdef _ENABLE_XRC_
        PRINT_DEBUG(DEBUG_XRC_verbose>0, "SendClose2 %d 0x%08x %d\n",
vc->pg_rank, vc->ch.xrc_flags,
                vc->ch.state);
#endif
    MPIDI_CH3U_VC_SendClose( vc, i );
}
else
{
                    MPIDI_PG_release_ref(vc->pg, &inuse);
                    if (inuse == 0)
                    {
                        MPIDI_PG_Destroy(vc->pg);
                    }

*To this:*

if (vc->state == MPIDI_VC_STATE_ACTIVE ||
            vc->state == MPIDI_VC_STATE_LOCAL_ACTIVE ||
    vc->state == MPIDI_VC_STATE_REMOTE_CLOSE)
#endif
{
#ifdef _ENABLE_XRC_
        PRINT_DEBUG(DEBUG_XRC_verbose>0, "SendClose2 %d 0x%08x %d\n",
vc->pg_rank, vc->ch.xrc_flags,
                vc->ch.state);
#endif
    MPIDI_CH3U_VC_SendClose( vc, i );
}
else
{
                    if(vc->pg != NULL)
                    {
                         MPIDI_PG_release_ref(vc->pg, &inuse);
                         if (inuse == 0)
                         {
                             MPIDI_PG_Destroy(vc->pg);
                         }
                    }

Ensuring that the vc->pg is not NULL before release and being destroyed
prevents the segfault that occurs when on occasion the vc->state in the
above check shows an "inactive" current state.

Thank you for your time and patience.

Respectfully,
Neil Spruit

On Thu, Feb 26, 2015 at 6:58 AM, Hari Subramoni <subramoni.1 at osu.edu> wrote:

> Hello Neil,
>
> Apologies for the delay in responding. The patch you sent looks good. I
> don't see any issues with it. However, it would good if you could send us
> the reproducer so that we can try to reproduce it locally and identify the
> root cause.
>
> Regards,
> Hari.
>
> On Tue, Feb 24, 2015 at 2:46 PM, Neil Spruit <nrspruit at gmail.com> wrote:
>
>> Hello,
>>
>> I have been working on a Host/Client setup with mvapich and I have run
>> into an issue that seems to come up when performing connect/accept between
>> two mpi processes, specifically when using MPI_THREAD_MULTIPLE.
>>
>> The scenario is I have one process on my Host system connecting to an MPI
>> process launched separately on a remote machine. Sometimes this connection
>> succeeds and the connection is made. However, some of the time during the
>> connection there is a segmentation fault that occurs within mvapich.
>> Specifically the error is tied to the change of a *vc->state* from
>> active to inactive which causes the code to attempt to free the *vc->pg*
>> which is NULL and there is no check in the code to ensure this attempt to
>> free a NULL does not occur, hence the segfault.
>>
>> The issue occurs in /src/mpid/ch3/src/mpid_vc.c: from line 275-292
>>
>> if (vc->state == MPIDI_VC_STATE_ACTIVE ||
>>             vc->state == MPIDI_VC_STATE_LOCAL_ACTIVE ||
>>     vc->state == MPIDI_VC_STATE_REMOTE_CLOSE)
>> #endif
>> {
>> #ifdef _ENABLE_XRC_
>>         PRINT_DEBUG(DEBUG_XRC_verbose>0, "SendClose2 %d 0x%08x %d\n",
>> vc->pg_rank, vc->ch.xrc_flags,
>>                 vc->ch.state);
>> #endif
>>     MPIDI_CH3U_VC_SendClose( vc, i );
>> }
>> else
>> {
>>                     MPIDI_PG_release_ref(vc->pg, &inuse);
>>                     if (inuse == 0)
>>                     {
>>                         MPIDI_PG_Destroy(vc->pg);
>>                     }
>>
>> In the above section of code the vc->state during my program goes from
>> being MPIDI_VC_STATE_ACTIVE to being MPIDI_VC_STATE_INACTIVE. This causes
>> the if check to instead go to the else condition where
>>  MPIDI_PG_release_ref attempts to use the vc->pg variable, which is always
>> NULL in my case ( I noticed most uses of this set pg to NULL except for a
>> very few cases).
>>
>> This does not happen all the time, in fact in most cases I see this 1/10
>> connection calls have this occur. Also, this only fails in the
>> MPI_Comm_connect call with the below backtrace.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>>
>> [Switching to Thread 0x7f53a2336700 (LWP 5156)]
>>
>> 0x00007f53a58c67c7 in MPID_VCRT_Release () from
>> /usr/local/lib/libmpich.so.12
>>
>> (gdb) bt
>>
>> #0  0x00007f53a58c67c7 in MPID_VCRT_Release () from
>> /usr/local/lib/libmpich.so.12
>>
>> #1  0x00007f53a5884103 in MPIR_Comm_delete_internal () from
>> /usr/local/lib/libmpich.so.12
>>
>> #2  0x00007f53a58a5e0c in MPIDI_Comm_connect () from
>> /usr/local/lib/libmpich.so.12
>>
>> #3  0x00007f53a58c20f3 in MPID_Comm_connect () from
>> /usr/local/lib/libmpich.so.12
>>
>> #4  0x00007f53a5b52956 in PMPI_Comm_connect () from
>> /usr/local/lib/libmpich.so.12
>>
>> #5  0x00007f53a6564255 in _MPIComm::Connect (this=0x7f5398004010,
>>
>>     address=<value optimized out>, port=<value optimized out>
>>
>>
>> Just for the sake of trying a quick fix I added a check in this code
>> locally to verify the vc->pg was not NULL before attempting to use it and
>> the segfault went away and I have not seen any issue/errors in the
>> operation of the MPI code since that change, however that is fixing a
>> symptom not the root of the issue.
>>
>>
>> The question I have is why would this connection go from being active to
>> inactive and then get into this state?
>>
>>
>> In all cases I see this sequence of events(added prints to the code):
>>
>>
>> I just set a NULL process group for vc 0x7f7920000940
>>
>> setting vc 0x7f7920000940 active
>>
>> vc 0x7f7920000940 vc->state is 1
>>
>> vc 0x7f7920000940 vc->state is 1
>>
>> NULL process group
>>
>>
>> Which implies that the state should have been set to active, but
>> somewhere else in the code the state went to inactive before the state
>> check in this code.
>>
>>
>> I am working on a reliable reproducer, but it has been tricky since it is
>> most definitely a timing issue.
>>
>>
>> Please let me know what you can find out about this situation, I am using
>> mvapich2-2.0.1, but I have also seen this issue in mvapich2-2.1rc1 as well.
>>
>>
>> Thank you very much for your time.
>>
>>
>> Respectfully,
>>
>> Neil Spruit
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150303/391ca2d2/attachment-0002.html>


More information about the mvapich-discuss mailing list