[mvapich-discuss] Errors with 2.3b and more than 64 processes

Vladimir Florinski vaf0001 at uah.edu
Thu Sep 21 11:59:04 EDT 2017


We are having problems with mvapich2 version 2.3b. Applications work as
expected when run on 64 or fewer cores, but crash with more than that. This
was tested on 16-core, 12-core, and 8-core nodes, with different Infiniband
connections (DDR, QDR, FDR) and there is no difference, which implies that
the MPI library is the likely culprit. Here is some typical error output:

                                              Stack trace of thread 3776:
                                               #0  0x00007f6f5d6ce61f
MPIDI_CH3_iStartMsgv (libmpi.so.12)
                                               #1  0x00007f6f5d6bd15f
MPIDI_CH3_EagerContigSend (libmpi.so.12)
                                               #2  0x00007f6f5d6c5273
MPID_Send (libmpi.so.12)
                                               #3  0x00007f6f5d6645d7
MPIC_Send (libmpi.so.12)
                                               #4  0x00007f6f5d4263b7
MPIR_Bcast_binomial (libmpi.so.12)
                                               #5  0x00007f6f5d427cc5
MPIR_Bcast_intra (libmpi.so.12)
                                               #6  0x00007f6f5d48102a
MPIR_Bcast_index_tuned_intra_MV2 (libmpi.so.12)
                                               #7  0x00007f6f5d47ec34
MPIR_Bcast_MV2 (libmpi.so.12)
                                               #8  0x00007f6f5d427fb5
MPIR_Bcast_intra (libmpi.so.12)
                                               #9  0x00007f6f5d48102a
MPIR_Bcast_index_tuned_intra_MV2 (libmpi.so.12)
                                               #10 0x00007f6f5d47ec34
MPIR_Bcast_MV2 (libmpi.so.12)
                                               #11 0x00007f6f5d4289bb
MPIR_Bcast_impl (libmpi.so.12)
                                               #12 0x00007f6f5d4291b9
PMPI_Bcast (libmpi.so.12)
                                               #13 0x00007f6f5da8bb95
pmpi_bcast_ (libmpifort.so.12)
                                               #14 0x0000556d8ca67814
__level2_subroutines_MOD_init (neutrals_PUIs)
                                               #15 0x0000556d8ca6a2c9
neutrals (neutrals_PUIs)
                                               #16 0x0000556d8ca448c9 main
(neutrals_PUIs)
                                               #17 0x00007f6f5c4c450a
__libc_start_main (libc.so.6)
                                               #18 0x0000556d8ca4491a
_start (neutrals_PUIs)

                                               Stack trace of thread 3823:
                                               #0  0x00007f6f5c5a41cd
__read (libc.so.6)
                                               #1  0x00007f6f5b84cdef
ibv_get_async_event (libibverbs.so.1)
                                               #2  0x00007f6f5d6fdf8b
async_thread (libmpi.so.12)
                                               #3  0x00007f6f5b00336d
start_thread (libpthread.so.0)
                                               #4  0x00007f6f5c5b4bbf
__clone (libc.so.6)

                                               Stack trace of thread 3824:
                                               #0  0x00007f6f5c5a41cd
__read (libc.so.6)
                                               #1  0x00007f6f5b84f95c
ibv_get_cq_event (libibverbs.so.1)
                                               #2  0x00007f6f5d70bcc9
cm_completion_handler (libmpi.so.12)
                                               #3  0x00007f6f5b00336d
start_thread (libpthread.so.0)
                                               #4  0x00007f6f5c5b4bbf
__clone (libc.so.6)

                                               Stack trace of thread 3826:
                                               #0  0x00007f6f5b00990b
pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                                               #1  0x00007f6f5d70ba67
cm_timeout_handler (libmpi.so.12)
                                               #2  0x00007f6f5b00336d
start_thread (libpthread.so.0)
                                               #3  0x00007f6f5c5b4bbf
__clone (libc.so.6)

The errors seem to appear whenever there is a collective communication
call. The library was built like this

--with-device=ch3:mrail --with-rdma=gen2 --with-pmi=pmi2 --with-pm=slurm
--enable-g=dbg --enable-debuginfo

We need help troubleshooting the issue. We can provide guest access to
debug if necessary.

Thanks,


-- 
Vladimir Florinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170921/dff37051/attachment.html>


More information about the mvapich-discuss mailing list