[mvapich-discuss] dreg.c assertion failure

Martin Pokorny mpokorny at nrao.edu
Wed Apr 2 10:51:21 EDT 2014


Hello, Hari.

On 03/31/2014 07:03 PM, Hari Subramoni wrote:
> Could you please provide us a reproducer so that we can debug it
> locally?

That is, unfortunately, not possible. These errors appear relatively 
rarely in an event driven program that is part of complex real-time 
system, which cannot be easily replicated. I wish that I could find some 
other program that would reproduce the error, but I don't have such a 
program at this time.

> We recently released MVAPICH2-2.0rc1 which includes several
> bug-fixes and performance improvements. Could you also try with that and
> see if the error goes away?

I will try that at the next opportunity. However, that opportunity may 
not appear for several days, and even then, given the infrequency of the 
error, it could be a while before I can determine whether the error no 
longer occurs. Have there been any bug fixes since 1.9a2 that might 
plausibly resolve an error at least similar to that I've encountered?


> On Mon, Mar 31, 2014 at 10:52 AM, Martin Pokorny <mpokorny at nrao.edu
> <mailto:mpokorny at nrao.edu>> wrote:
>
>     Hello, all.
>
>     We've been running a locally built version of mvapich2-1.9a2 for
>     about one year now, and on rare occasions we see the following error:
>
>         Assertion failed in file
>         src/mpid/ch3/channels/common/__src/reg_cache/dreg.c at line 899:
>         d->is_valid == 0
>
>
>     Stack traces for these errors show various paths to the assertion
>     failure, at least in the application code (which I've been trying to
>     rule out as the cause, however unlikely for an assertion failure.)
>     Here's a bit of the stack trace for a typical error:
>
>         [cbe-node-08:mpi_rank_6][__print_backtrace]   0:
>         /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__print_backtrace+0x1e)
>         [0x7fed12b5d8ce]
>         [cbe-node-08:mpi_rank_6][__print_backtrace]   1:
>         /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__MPIDI_CH3_Abort+0x6f)
>         [0x7fed12b107ff]
>         [cbe-node-08:mpi_rank_6][__print_backtrace]   2:
>         /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__MPID_Abort+0x7f)
>         [0x7fed12af3adf]
>         [cbe-node-08:mpi_rank_6][__print_backtrace]   3:
>         /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__MPIR_Assert_fail+0xa2)
>         [0x7fed12abceb2]
>         [cbe-node-08:mpi_rank_6][__print_backtrace]   4:
>         /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__flush_dereg_mrs_external+__0x290)
>         [0x7fed12b2df80]
>         [cbe-node-08:mpi_rank_6][__print_backtrace]   5:
>         /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__free+0xda)
>         [0x7fed12b5552b]
>         [cbe-node-08:mpi_rank_6][__print_backtrace]   6:
>         /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpl.so.1(MPL___trfree+0x4aa)
>         [0x7fed126188da]
>
>
>     The rate of failure is rather low -- I hadn't seen such an error in
>     2-3 months prior to an event this weekend -- but the impact of these
>     errors can be significant. Is there any further information I can
>     provide to help diagnose the cause? I'm willing to rebuild mvapich2
>     with various options or to try a newer version.
>
>     --
>     Martin Pokorny
>     Software Engineer - New Mexico Systems Group lead
>     National Radio Astronomy Observatory - New Mexico Operations
>
>     _________________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-__state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.__edu/mailman/listinfo/mvapich-__discuss
>     <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
>
>


-- 
Martin Pokorny
Software Engineer - New Mexico Systems Group lead
National Radio Astronomy Observatory - New Mexico Operations



More information about the mvapich-discuss mailing list