[mvapich-discuss] dreg.c assertion failure
Martin Pokorny
mpokorny at nrao.edu
Mon Mar 31 10:52:10 EDT 2014
Hello, all.
We've been running a locally built version of mvapich2-1.9a2 for about
one year now, and on rare occasions we see the following error:
> Assertion failed in file src/mpid/ch3/channels/common/src/reg_cache/dreg.c at line 899: d->is_valid == 0
Stack traces for these errors show various paths to the assertion
failure, at least in the application code (which I've been trying to
rule out as the cause, however unlikely for an assertion failure.)
Here's a bit of the stack trace for a typical error:
> [cbe-node-08:mpi_rank_6][print_backtrace] 0: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(print_backtrace+0x1e) [0x7fed12b5d8ce]
> [cbe-node-08:mpi_rank_6][print_backtrace] 1: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(MPIDI_CH3_Abort+0x6f) [0x7fed12b107ff]
> [cbe-node-08:mpi_rank_6][print_backtrace] 2: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(MPID_Abort+0x7f) [0x7fed12af3adf]
> [cbe-node-08:mpi_rank_6][print_backtrace] 3: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(MPIR_Assert_fail+0xa2) [0x7fed12abceb2]
> [cbe-node-08:mpi_rank_6][print_backtrace] 4: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(flush_dereg_mrs_external+0x290) [0x7fed12b2df80]
> [cbe-node-08:mpi_rank_6][print_backtrace] 5: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(free+0xda) [0x7fed12b5552b]
> [cbe-node-08:mpi_rank_6][print_backtrace] 6: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpl.so.1(MPL_trfree+0x4aa) [0x7fed126188da]
The rate of failure is rather low -- I hadn't seen such an error in 2-3
months prior to an event this weekend -- but the impact of these errors
can be significant. Is there any further information I can provide to
help diagnose the cause? I'm willing to rebuild mvapich2 with various
options or to try a newer version.
--
Martin Pokorny
Software Engineer - New Mexico Systems Group lead
National Radio Astronomy Observatory - New Mexico Operations
More information about the mvapich-discuss
mailing list