[mvapich-discuss] bus error in MPIDI_CH3I_CM_SHMEM_Sync
Subramoni, Hari
subramoni.1 at osu.edu
Fri Dec 4 17:29:19 EST 2020
Hi, Lana.
Sorry to hear that you’re facing issues. If possible, could you please try out your program with the new MVAPICH2 2.3.5 release we made a few days ago and see if it resolves your issues?
Best,
Hari.
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Lana Deere
Sent: Friday, December 4, 2020 4:23 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] bus error in MPIDI_CH3I_CM_SHMEM_Sync
I'm having a rarely-occurring problem using mvapich2 2.3.4 GA on a CentOS7 cluster.
I've run our proprietary program using the same input dataset on the same cluster several hundred times, and about 1% of the time the run crashes with a bus error at a traceback which looks like this:
...ty/lib/libmpi.so.12 MPIDI_CH3I_CM_SHMEM_Sync
...ty/lib/libmpi.so.12 MPIDI_CH3I_SMP_init
...ty/lib/libmpi.so.12 MPIDI_CH3_Init
...ty/lib/libmpi.so.12 MPID_Init
...ty/lib/libmpi.so.12 MPIR_Init_thread
...ty/lib/libmpi.so.12 MPI_Init_thread
I'm not sure where it is exactly inside MPIDI_CH3I_CM_SHMEM_Sync.
The process which gets the bus error is always a child subprocess created using MPI_Comm_spawn. The rest of the child subprocesses are hung somewhere in MPI_Init_thread (or it's subfunctions) and the parent processes are all hung somewhere in MPI_Comm_spawn (or its subfunctions).
Has anyone seen anything like this before? Does anyone have any suggestions on how to try debugging it? I see some PRINT_DEBUG statements in the function but I don't know how to turn them on.
.. Lana (lana.deere at gmail.com<mailto:lana.deere at gmail.com>)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201204/016a1fb8/attachment.html>
More information about the mvapich-discuss
mailing list