[mvapich-discuss] dreg.c assertion failure
Martin Pokorny
mpokorny at nrao.edu
Wed Apr 2 10:51:21 EDT 2014
Hello, Hari.
On 03/31/2014 07:03 PM, Hari Subramoni wrote:
> Could you please provide us a reproducer so that we can debug it
> locally?
That is, unfortunately, not possible. These errors appear relatively
rarely in an event driven program that is part of complex real-time
system, which cannot be easily replicated. I wish that I could find some
other program that would reproduce the error, but I don't have such a
program at this time.
> We recently released MVAPICH2-2.0rc1 which includes several
> bug-fixes and performance improvements. Could you also try with that and
> see if the error goes away?
I will try that at the next opportunity. However, that opportunity may
not appear for several days, and even then, given the infrequency of the
error, it could be a while before I can determine whether the error no
longer occurs. Have there been any bug fixes since 1.9a2 that might
plausibly resolve an error at least similar to that I've encountered?
> On Mon, Mar 31, 2014 at 10:52 AM, Martin Pokorny <mpokorny at nrao.edu
> <mailto:mpokorny at nrao.edu>> wrote:
>
> Hello, all.
>
> We've been running a locally built version of mvapich2-1.9a2 for
> about one year now, and on rare occasions we see the following error:
>
> Assertion failed in file
> src/mpid/ch3/channels/common/__src/reg_cache/dreg.c at line 899:
> d->is_valid == 0
>
>
> Stack traces for these errors show various paths to the assertion
> failure, at least in the application code (which I've been trying to
> rule out as the cause, however unlikely for an assertion failure.)
> Here's a bit of the stack trace for a typical error:
>
> [cbe-node-08:mpi_rank_6][__print_backtrace] 0:
> /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__print_backtrace+0x1e)
> [0x7fed12b5d8ce]
> [cbe-node-08:mpi_rank_6][__print_backtrace] 1:
> /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__MPIDI_CH3_Abort+0x6f)
> [0x7fed12b107ff]
> [cbe-node-08:mpi_rank_6][__print_backtrace] 2:
> /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__MPID_Abort+0x7f)
> [0x7fed12af3adf]
> [cbe-node-08:mpi_rank_6][__print_backtrace] 3:
> /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__MPIR_Assert_fail+0xa2)
> [0x7fed12abceb2]
> [cbe-node-08:mpi_rank_6][__print_backtrace] 4:
> /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__flush_dereg_mrs_external+__0x290)
> [0x7fed12b2df80]
> [cbe-node-08:mpi_rank_6][__print_backtrace] 5:
> /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpich.so.8(__free+0xda)
> [0x7fed12b5552b]
> [cbe-node-08:mpi_rank_6][__print_backtrace] 6:
> /opt/cbe-local/stow/mvapich2-__1.9a2-mp/lib/libmpl.so.1(MPL___trfree+0x4aa)
> [0x7fed126188da]
>
>
> The rate of failure is rather low -- I hadn't seen such an error in
> 2-3 months prior to an event this weekend -- but the impact of these
> errors can be significant. Is there any further information I can
> provide to help diagnose the cause? I'm willing to rebuild mvapich2
> with various options or to try a newer version.
>
> --
> Martin Pokorny
> Software Engineer - New Mexico Systems Group lead
> National Radio Astronomy Observatory - New Mexico Operations
>
> _________________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-__state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.__edu/mailman/listinfo/mvapich-__discuss
> <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
>
>
--
Martin Pokorny
Software Engineer - New Mexico Systems Group lead
National Radio Astronomy Observatory - New Mexico Operations
More information about the mvapich-discuss
mailing list