[mvapich-discuss] Errors with 2.3b and more than 64 processes

Hari Subramoni subramoni.1 at osu.edu
Thu Sep 21 15:15:56 EDT 2017


Hi,

Sorry to hear that you're facing issues with MVAPICH2. Could you please let
us know what application you are using here? Do basic OSU Microbenchmark
tests like osu_bcast work on the same system/process count combinations?
Could you please send us the output of mpiname -a?

Can you please try setting "MV2_ON_DEMAND_THRESHOLD=<number of
processes+1>" and see if the test passes? The following section of the
MVAPICH2 userguide has more information on this.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/
mvapich2-2.3b-userguide.html#x1-21800011.43

If the above does not help, can you please try setting
"MV2_USE_SHMEM_COLL=0" and see if the test passes? The following section of
the MVAPICH2 userguide has more information on this.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3b-userguide.html#x1-27000011.95

Getting access to the system always helps to debug things faster.

Thx,
Hari.

On Thu, Sep 21, 2017 at 11:59 AM, Vladimir Florinski <vaf0001 at uah.edu>
wrote:

> We are having problems with mvapich2 version 2.3b. Applications work as
> expected when run on 64 or fewer cores, but crash with more than that. This
> was tested on 16-core, 12-core, and 8-core nodes, with different Infiniband
> connections (DDR, QDR, FDR) and there is no difference, which implies that
> the MPI library is the likely culprit. Here is some typical error output:
>
>                                               Stack trace of thread 3776:
>                                                #0  0x00007f6f5d6ce61f
> MPIDI_CH3_iStartMsgv (libmpi.so.12)
>                                                #1  0x00007f6f5d6bd15f
> MPIDI_CH3_EagerContigSend (libmpi.so.12)
>                                                #2  0x00007f6f5d6c5273
> MPID_Send (libmpi.so.12)
>                                                #3  0x00007f6f5d6645d7
> MPIC_Send (libmpi.so.12)
>                                                #4  0x00007f6f5d4263b7
> MPIR_Bcast_binomial (libmpi.so.12)
>                                                #5  0x00007f6f5d427cc5
> MPIR_Bcast_intra (libmpi.so.12)
>                                                #6  0x00007f6f5d48102a
> MPIR_Bcast_index_tuned_intra_MV2 (libmpi.so.12)
>                                                #7  0x00007f6f5d47ec34
> MPIR_Bcast_MV2 (libmpi.so.12)
>                                                #8  0x00007f6f5d427fb5
> MPIR_Bcast_intra (libmpi.so.12)
>                                                #9  0x00007f6f5d48102a
> MPIR_Bcast_index_tuned_intra_MV2 (libmpi.so.12)
>                                                #10 0x00007f6f5d47ec34
> MPIR_Bcast_MV2 (libmpi.so.12)
>                                                #11 0x00007f6f5d4289bb
> MPIR_Bcast_impl (libmpi.so.12)
>                                                #12 0x00007f6f5d4291b9
> PMPI_Bcast (libmpi.so.12)
>                                                #13 0x00007f6f5da8bb95
> pmpi_bcast_ (libmpifort.so.12)
>                                                #14 0x0000556d8ca67814
> __level2_subroutines_MOD_init (neutrals_PUIs)
>                                                #15 0x0000556d8ca6a2c9
> neutrals (neutrals_PUIs)
>                                                #16 0x0000556d8ca448c9
> main (neutrals_PUIs)
>                                                #17 0x00007f6f5c4c450a
> __libc_start_main (libc.so.6)
>                                                #18 0x0000556d8ca4491a
> _start (neutrals_PUIs)
>
>                                                Stack trace of thread 3823:
>                                                #0  0x00007f6f5c5a41cd
> __read (libc.so.6)
>                                                #1  0x00007f6f5b84cdef
> ibv_get_async_event (libibverbs.so.1)
>                                                #2  0x00007f6f5d6fdf8b
> async_thread (libmpi.so.12)
>                                                #3  0x00007f6f5b00336d
> start_thread (libpthread.so.0)
>                                                #4  0x00007f6f5c5b4bbf
> __clone (libc.so.6)
>
>                                                Stack trace of thread 3824:
>                                                #0  0x00007f6f5c5a41cd
> __read (libc.so.6)
>                                                #1  0x00007f6f5b84f95c
> ibv_get_cq_event (libibverbs.so.1)
>                                                #2  0x00007f6f5d70bcc9
> cm_completion_handler (libmpi.so.12)
>                                                #3  0x00007f6f5b00336d
> start_thread (libpthread.so.0)
>                                                #4  0x00007f6f5c5b4bbf
> __clone (libc.so.6)
>
>                                                Stack trace of thread 3826:
>                                                #0  0x00007f6f5b00990b
> pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
>                                                #1  0x00007f6f5d70ba67
> cm_timeout_handler (libmpi.so.12)
>                                                #2  0x00007f6f5b00336d
> start_thread (libpthread.so.0)
>                                                #3  0x00007f6f5c5b4bbf
> __clone (libc.so.6)
>
> The errors seem to appear whenever there is a collective communication
> call. The library was built like this
>
> --with-device=ch3:mrail --with-rdma=gen2 --with-pmi=pmi2 --with-pm=slurm
> --enable-g=dbg --enable-debuginfo
>
> We need help troubleshooting the issue. We can provide guest access to
> debug if necessary.
>
> Thanks,
>
>
> --
> Vladimir Florinski
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170921/795cea0b/attachment-0001.html>


More information about the mvapich-discuss mailing list