[mvapich-discuss] Segfault seen during connect/accept with dynamic processes

Hari Subramoni subramoni.1 at osu.edu
Thu Feb 26 09:58:40 EST 2015


Hello Neil,

Apologies for the delay in responding. The patch you sent looks good. I
don't see any issues with it. However, it would good if you could send us
the reproducer so that we can try to reproduce it locally and identify the
root cause.

Regards,
Hari.

On Tue, Feb 24, 2015 at 2:46 PM, Neil Spruit <nrspruit at gmail.com> wrote:

> Hello,
>
> I have been working on a Host/Client setup with mvapich and I have run
> into an issue that seems to come up when performing connect/accept between
> two mpi processes, specifically when using MPI_THREAD_MULTIPLE.
>
> The scenario is I have one process on my Host system connecting to an MPI
> process launched separately on a remote machine. Sometimes this connection
> succeeds and the connection is made. However, some of the time during the
> connection there is a segmentation fault that occurs within mvapich.
> Specifically the error is tied to the change of a *vc->state* from active
> to inactive which causes the code to attempt to free the *vc->pg* which
> is NULL and there is no check in the code to ensure this attempt to free a
> NULL does not occur, hence the segfault.
>
> The issue occurs in /src/mpid/ch3/src/mpid_vc.c: from line 275-292
>
> if (vc->state == MPIDI_VC_STATE_ACTIVE ||
>             vc->state == MPIDI_VC_STATE_LOCAL_ACTIVE ||
>     vc->state == MPIDI_VC_STATE_REMOTE_CLOSE)
> #endif
> {
> #ifdef _ENABLE_XRC_
>         PRINT_DEBUG(DEBUG_XRC_verbose>0, "SendClose2 %d 0x%08x %d\n",
> vc->pg_rank, vc->ch.xrc_flags,
>                 vc->ch.state);
> #endif
>     MPIDI_CH3U_VC_SendClose( vc, i );
> }
> else
> {
>                     MPIDI_PG_release_ref(vc->pg, &inuse);
>                     if (inuse == 0)
>                     {
>                         MPIDI_PG_Destroy(vc->pg);
>                     }
>
> In the above section of code the vc->state during my program goes from
> being MPIDI_VC_STATE_ACTIVE to being MPIDI_VC_STATE_INACTIVE. This causes
> the if check to instead go to the else condition where
>  MPIDI_PG_release_ref attempts to use the vc->pg variable, which is always
> NULL in my case ( I noticed most uses of this set pg to NULL except for a
> very few cases).
>
> This does not happen all the time, in fact in most cases I see this 1/10
> connection calls have this occur. Also, this only fails in the
> MPI_Comm_connect call with the below backtrace.
>
> Program received signal SIGSEGV, Segmentation fault.
>
> [Switching to Thread 0x7f53a2336700 (LWP 5156)]
>
> 0x00007f53a58c67c7 in MPID_VCRT_Release () from
> /usr/local/lib/libmpich.so.12
>
> (gdb) bt
>
> #0  0x00007f53a58c67c7 in MPID_VCRT_Release () from
> /usr/local/lib/libmpich.so.12
>
> #1  0x00007f53a5884103 in MPIR_Comm_delete_internal () from
> /usr/local/lib/libmpich.so.12
>
> #2  0x00007f53a58a5e0c in MPIDI_Comm_connect () from
> /usr/local/lib/libmpich.so.12
>
> #3  0x00007f53a58c20f3 in MPID_Comm_connect () from
> /usr/local/lib/libmpich.so.12
>
> #4  0x00007f53a5b52956 in PMPI_Comm_connect () from
> /usr/local/lib/libmpich.so.12
>
> #5  0x00007f53a6564255 in _MPIComm::Connect (this=0x7f5398004010,
>
>     address=<value optimized out>, port=<value optimized out>
>
>
> Just for the sake of trying a quick fix I added a check in this code
> locally to verify the vc->pg was not NULL before attempting to use it and
> the segfault went away and I have not seen any issue/errors in the
> operation of the MPI code since that change, however that is fixing a
> symptom not the root of the issue.
>
>
> The question I have is why would this connection go from being active to
> inactive and then get into this state?
>
>
> In all cases I see this sequence of events(added prints to the code):
>
>
> I just set a NULL process group for vc 0x7f7920000940
>
> setting vc 0x7f7920000940 active
>
> vc 0x7f7920000940 vc->state is 1
>
> vc 0x7f7920000940 vc->state is 1
>
> NULL process group
>
>
> Which implies that the state should have been set to active, but somewhere
> else in the code the state went to inactive before the state check in this
> code.
>
>
> I am working on a reliable reproducer, but it has been tricky since it is
> most definitely a timing issue.
>
>
> Please let me know what you can find out about this situation, I am using
> mvapich2-2.0.1, but I have also seen this issue in mvapich2-2.1rc1 as well.
>
>
> Thank you very much for your time.
>
>
> Respectfully,
>
> Neil Spruit
>
>
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150226/baf19cfe/attachment.html>


More information about the mvapich-discuss mailing list