[mvapich-discuss] dreg.c assertion failure

Hari Subramoni subramoni.1 at osu.edu
Mon Mar 31 21:03:53 EDT 2014


Hello Martin,

Could you please provide us a reproducer so that we can debug it locally?
We recently released MVAPICH2-2.0rc1 which includes several bug-fixes and
performance improvements. Could you also try with that and see if the error
goes away?

Regards,
Hari.


On Mon, Mar 31, 2014 at 10:52 AM, Martin Pokorny <mpokorny at nrao.edu> wrote:

> Hello, all.
>
> We've been running a locally built version of mvapich2-1.9a2 for about one
> year now, and on rare occasions we see the following error:
>
>  Assertion failed in file src/mpid/ch3/channels/common/src/reg_cache/dreg.c
>> at line 899: d->is_valid == 0
>>
>
> Stack traces for these errors show various paths to the assertion failure,
> at least in the application code (which I've been trying to rule out as the
> cause, however unlikely for an assertion failure.) Here's a bit of the
> stack trace for a typical error:
>
>  [cbe-node-08:mpi_rank_6][print_backtrace]   0:
>> /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(print_backtrace+0x1e)
>> [0x7fed12b5d8ce]
>> [cbe-node-08:mpi_rank_6][print_backtrace]   1:
>> /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(MPIDI_CH3_Abort+0x6f)
>> [0x7fed12b107ff]
>> [cbe-node-08:mpi_rank_6][print_backtrace]   2:
>> /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(MPID_Abort+0x7f)
>> [0x7fed12af3adf]
>> [cbe-node-08:mpi_rank_6][print_backtrace]   3:
>> /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(MPIR_Assert_fail+0xa2)
>> [0x7fed12abceb2]
>> [cbe-node-08:mpi_rank_6][print_backtrace]   4:
>> /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(
>> flush_dereg_mrs_external+0x290) [0x7fed12b2df80]
>> [cbe-node-08:mpi_rank_6][print_backtrace]   5:
>> /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(free+0xda)
>> [0x7fed12b5552b]
>> [cbe-node-08:mpi_rank_6][print_backtrace]   6:
>> /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpl.so.1(MPL_trfree+0x4aa)
>> [0x7fed126188da]
>>
>
> The rate of failure is rather low -- I hadn't seen such an error in 2-3
> months prior to an event this weekend -- but the impact of these errors can
> be significant. Is there any further information I can provide to help
> diagnose the cause? I'm willing to rebuild mvapich2 with various options or
> to try a newer version.
>
> --
> Martin Pokorny
> Software Engineer - New Mexico Systems Group lead
> National Radio Astronomy Observatory - New Mexico Operations
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140331/dd5e1484/attachment-0001.html>


More information about the mvapich-discuss mailing list