[mvapich-discuss] dreg.c assertion failure

Martin Pokorny mpokorny at nrao.edu
Mon Mar 31 10:52:10 EDT 2014


Hello, all.

We've been running a locally built version of mvapich2-1.9a2 for about 
one year now, and on rare occasions we see the following error:

> Assertion failed in file src/mpid/ch3/channels/common/src/reg_cache/dreg.c at line 899: d->is_valid == 0

Stack traces for these errors show various paths to the assertion 
failure, at least in the application code (which I've been trying to 
rule out as the cause, however unlikely for an assertion failure.) 
Here's a bit of the stack trace for a typical error:

> [cbe-node-08:mpi_rank_6][print_backtrace]   0: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(print_backtrace+0x1e) [0x7fed12b5d8ce]
> [cbe-node-08:mpi_rank_6][print_backtrace]   1: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(MPIDI_CH3_Abort+0x6f) [0x7fed12b107ff]
> [cbe-node-08:mpi_rank_6][print_backtrace]   2: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(MPID_Abort+0x7f) [0x7fed12af3adf]
> [cbe-node-08:mpi_rank_6][print_backtrace]   3: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(MPIR_Assert_fail+0xa2) [0x7fed12abceb2]
> [cbe-node-08:mpi_rank_6][print_backtrace]   4: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(flush_dereg_mrs_external+0x290) [0x7fed12b2df80]
> [cbe-node-08:mpi_rank_6][print_backtrace]   5: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpich.so.8(free+0xda) [0x7fed12b5552b]
> [cbe-node-08:mpi_rank_6][print_backtrace]   6: /opt/cbe-local/stow/mvapich2-1.9a2-mp/lib/libmpl.so.1(MPL_trfree+0x4aa) [0x7fed126188da]

The rate of failure is rather low -- I hadn't seen such an error in 2-3 
months prior to an event this weekend -- but the impact of these errors 
can be significant. Is there any further information I can provide to 
help diagnose the cause? I'm willing to rebuild mvapich2 with various 
options or to try a newer version.

-- 
Martin Pokorny
Software Engineer - New Mexico Systems Group lead
National Radio Astronomy Observatory - New Mexico Operations



More information about the mvapich-discuss mailing list