[Mvapich-discuss] bug error in MPI_Init_thread in subprocess

Lana Deere lana.deere at gmail.com
Fri Jan 8 12:43:12 EST 2021


I am getting Bus Errors inside MPI_Init_thread called from spawned
subprocesses.  I first started seeing this in 2.3.1.  I upgraded to 2.3.4
and still saw the problem.  Now I'm running 2.3.5-1 with
MV2_ENABLE_AFFINITY=0 and still seeing the problem.  Interestingly, if I
don't set ENABLE_AFFINITY=0 the problem seems to go away, but that cripples
my performance so that's not a useful solution.  Perhaps there is a race
condition inside the MPI_Init_thread code which I am hitting erratically?

The program's N parent processes MPI_Comm_spawn N child processes (i.e., 1
each) and intermittently one of the child processes gets a Bus Error inside
MPI_Init_thread.  The stack for a recent example was:
    0x1b0272b (no module or function available)
    libpthread.so.0 (function not available)
    MPIDI_CH3I_CM_SHMEM_Sync
    MPIDI_CH3I_SMP_init
    MPIDI_CH3_Init
    MPID_Init
    MPIR_Init_thread
    MPI_Init_thread

The other processes were all hanging, 8 parents in MPI_Comm_spawn and 7
children in MPI_Init_thread.  In more detail, in case it's helpful, here
are the stack traces for the remaining processes:

2x parent processes on worker10:
    mlx5_poll_cq_v1
    MPIDI_CH3I_MRAILI_Cq_poll_ib
    MPIDI_CH3I_read_progress
    MPIDI_CH3I_Progress
    MPIDI_Comm_accept
    MPID_Comm_accept
    MPIDI_Comm_spawn_multiple
    PMPI_Comm_spawn

1x parent on worker7, 2x parents on worker12, 1x parent on worker3:
    MPIDI_CH3I_MRAILI_Cq_poll_ib
    MPIDI_CH3I_read_progress
    MPIDI_CH3I_Progress
    MPIR_Bcast_binomial
    MPIR_Bcast_intra
    MPIR_Bcast_index_tuned_intra_MV2
    MPIR_Bcast_MV2
    MPIR_Bcast_intra
    MPIDI_Comm_accept
    MPID_Comm_accept
    MPIDI_Comm_spawn_multiple
    PMPI_Comm_spawn

1x parent on worker7, 1x parent on worker3:
    MPIDI_CH3I_SMP_pull_header
    MPIDI_CH3I_SMP_read_progress
    MPIDI_CH3I_Progress
    MPIR_Bcast_binomial
    MPIR_Bcast_intra
    MPIR_Bcast_index_tuned_intra_MV2
    MPIR_Bcast_MV2
    MPIR_Bcast_intra
    MPIDI_Comm_accept
    MPID_Comm_accept
    MPIDI_Comm_spawn_multiple
    PMPI_Comm_spawn

2x children on worker10, 1 child on worker7, 2x children on worker12, 1
child on worker3:
    MPIDI_CH3I_MRAILI_Cq_poll_ib
    MPIDI_CH3I_read_progress
    MPIDI_CH3I_Progress
    MPIR_Allreduce_pt2pt_rd_MV2
    MPIR_Allreduce_index_tuned_intra_MV2
    MPIR_Allreduce_impl
    MPIR_Get_contextid_sparse_group
    MPIDI_Comm_connect
    MPID_Comm_connect
    MPID_Init
    MPIR_Init_thread
    PMPI_Init_thread

1 child on worker7:
    MPIDI_CH3I_SMP_write_progress
    MPIDI_CH3I_Progress
    MPIR_Allreduce_pt2pt_rd_MV2
    MPIR_Allreduce_index_tuned_intra_MV2
    MPIR_Allreduce_impl
    MPIR_Get_contextid_sparse_group
    MPIDI_Comm_connect
    MPID_Comm_connect
    MPID_Init
    MPIR_Init_thread
    PMPI_Init_thread

.. Lana (lana.deere at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20210108/b2bdeb7c/attachment-0021.html>


More information about the Mvapich-discuss mailing list