[Mvapich-discuss] bug error in MPI_Init_thread in subprocess
Lana Deere
lana.deere at gmail.com
Fri Jan 8 12:43:12 EST 2021
I am getting Bus Errors inside MPI_Init_thread called from spawned
subprocesses. I first started seeing this in 2.3.1. I upgraded to 2.3.4
and still saw the problem. Now I'm running 2.3.5-1 with
MV2_ENABLE_AFFINITY=0 and still seeing the problem. Interestingly, if I
don't set ENABLE_AFFINITY=0 the problem seems to go away, but that cripples
my performance so that's not a useful solution. Perhaps there is a race
condition inside the MPI_Init_thread code which I am hitting erratically?
The program's N parent processes MPI_Comm_spawn N child processes (i.e., 1
each) and intermittently one of the child processes gets a Bus Error inside
MPI_Init_thread. The stack for a recent example was:
0x1b0272b (no module or function available)
libpthread.so.0 (function not available)
MPIDI_CH3I_CM_SHMEM_Sync
MPIDI_CH3I_SMP_init
MPIDI_CH3_Init
MPID_Init
MPIR_Init_thread
MPI_Init_thread
The other processes were all hanging, 8 parents in MPI_Comm_spawn and 7
children in MPI_Init_thread. In more detail, in case it's helpful, here
are the stack traces for the remaining processes:
2x parent processes on worker10:
mlx5_poll_cq_v1
MPIDI_CH3I_MRAILI_Cq_poll_ib
MPIDI_CH3I_read_progress
MPIDI_CH3I_Progress
MPIDI_Comm_accept
MPID_Comm_accept
MPIDI_Comm_spawn_multiple
PMPI_Comm_spawn
1x parent on worker7, 2x parents on worker12, 1x parent on worker3:
MPIDI_CH3I_MRAILI_Cq_poll_ib
MPIDI_CH3I_read_progress
MPIDI_CH3I_Progress
MPIR_Bcast_binomial
MPIR_Bcast_intra
MPIR_Bcast_index_tuned_intra_MV2
MPIR_Bcast_MV2
MPIR_Bcast_intra
MPIDI_Comm_accept
MPID_Comm_accept
MPIDI_Comm_spawn_multiple
PMPI_Comm_spawn
1x parent on worker7, 1x parent on worker3:
MPIDI_CH3I_SMP_pull_header
MPIDI_CH3I_SMP_read_progress
MPIDI_CH3I_Progress
MPIR_Bcast_binomial
MPIR_Bcast_intra
MPIR_Bcast_index_tuned_intra_MV2
MPIR_Bcast_MV2
MPIR_Bcast_intra
MPIDI_Comm_accept
MPID_Comm_accept
MPIDI_Comm_spawn_multiple
PMPI_Comm_spawn
2x children on worker10, 1 child on worker7, 2x children on worker12, 1
child on worker3:
MPIDI_CH3I_MRAILI_Cq_poll_ib
MPIDI_CH3I_read_progress
MPIDI_CH3I_Progress
MPIR_Allreduce_pt2pt_rd_MV2
MPIR_Allreduce_index_tuned_intra_MV2
MPIR_Allreduce_impl
MPIR_Get_contextid_sparse_group
MPIDI_Comm_connect
MPID_Comm_connect
MPID_Init
MPIR_Init_thread
PMPI_Init_thread
1 child on worker7:
MPIDI_CH3I_SMP_write_progress
MPIDI_CH3I_Progress
MPIR_Allreduce_pt2pt_rd_MV2
MPIR_Allreduce_index_tuned_intra_MV2
MPIR_Allreduce_impl
MPIR_Get_contextid_sparse_group
MPIDI_Comm_connect
MPID_Comm_connect
MPID_Init
MPIR_Init_thread
PMPI_Init_thread
.. Lana (lana.deere at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20210108/b2bdeb7c/attachment-0021.html>
More information about the Mvapich-discuss
mailing list