[mvapich-discuss] bus error in MPIDI_CH3I_CM_SHMEM_Sync

Lana Deere lana.deere at gmail.com
Fri Dec 4 16:22:45 EST 2020


I'm having a rarely-occurring problem using mvapich2 2.3.4 GA on a CentOS7
cluster.

I've run our proprietary program using the same input dataset on the same
cluster several hundred times, and about 1% of the time the run crashes
with a bus error at a traceback which looks like this:
...ty/lib/libmpi.so.12 MPIDI_CH3I_CM_SHMEM_Sync
...ty/lib/libmpi.so.12 MPIDI_CH3I_SMP_init
...ty/lib/libmpi.so.12 MPIDI_CH3_Init
...ty/lib/libmpi.so.12 MPID_Init
...ty/lib/libmpi.so.12 MPIR_Init_thread
...ty/lib/libmpi.so.12 MPI_Init_thread
I'm not sure where it is exactly inside MPIDI_CH3I_CM_SHMEM_Sync.

The process which gets the bus error is always a child subprocess created
using MPI_Comm_spawn.  The rest of the child subprocesses are hung
somewhere in MPI_Init_thread (or it's subfunctions) and the parent
processes are all hung somewhere in MPI_Comm_spawn (or its subfunctions).

Has anyone seen anything like this before?  Does anyone have any
suggestions on how to try debugging it?  I see some PRINT_DEBUG statements
in the function but I don't know how to turn them on.

.. Lana (lana.deere at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201204/c0f73153/attachment.html>


More information about the mvapich-discuss mailing list