[mvapich-discuss] bus error in MPIDI_CH3I_CM_SHMEM_Sync
Lana Deere
lana.deere at gmail.com
Fri Dec 4 16:22:45 EST 2020
I'm having a rarely-occurring problem using mvapich2 2.3.4 GA on a CentOS7
cluster.
I've run our proprietary program using the same input dataset on the same
cluster several hundred times, and about 1% of the time the run crashes
with a bus error at a traceback which looks like this:
...ty/lib/libmpi.so.12 MPIDI_CH3I_CM_SHMEM_Sync
...ty/lib/libmpi.so.12 MPIDI_CH3I_SMP_init
...ty/lib/libmpi.so.12 MPIDI_CH3_Init
...ty/lib/libmpi.so.12 MPID_Init
...ty/lib/libmpi.so.12 MPIR_Init_thread
...ty/lib/libmpi.so.12 MPI_Init_thread
I'm not sure where it is exactly inside MPIDI_CH3I_CM_SHMEM_Sync.
The process which gets the bus error is always a child subprocess created
using MPI_Comm_spawn. The rest of the child subprocesses are hung
somewhere in MPI_Init_thread (or it's subfunctions) and the parent
processes are all hung somewhere in MPI_Comm_spawn (or its subfunctions).
Has anyone seen anything like this before? Does anyone have any
suggestions on how to try debugging it? I see some PRINT_DEBUG statements
in the function but I don't know how to turn them on.
.. Lana (lana.deere at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201204/c0f73153/attachment.html>
More information about the mvapich-discuss
mailing list