[mvapich-discuss] Errors with 2.3b and more than 64 processes

Vladimir Florinski vaf0001 at uah.edu
Thu Sep 21 18:34:20 EDT 2017


Here are the results of testing:

First, OSU benchmarks displayed the same behavior - just tested with
osu_bcast and it caps at 64 processes and segfaults afterwards.

Output of mpiname:

MVAPICH2 2.3b Thu Aug 10 22:00:00 EST 2017 ch3:mrail

Compilation
CC: gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector-strong --param=ssp-buffer-size=4
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64
-mtune=generic   -DNDEBUG -DNVALGRIND -g -O2
CXX: g++ -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector-strong --param=ssp-buffer-size=4
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64
-mtune=generic  -DNDEBUG -DNVALGRIND -g -O2
F77: gfortran -O2 -g -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong
--param=ssp-buffer-size=4 -grecord-gcc-switches
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
-I/usr/lib64/gfortran/modules  -g -O2
FC: gfortran -O2 -g -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong
--param=ssp-buffer-size=4 -grecord-gcc-switches
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
-I/usr/lib64/gfortran/modules  -g -O2

Configuration
--build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu
--program-prefix= --disable-dependency-tracking --prefix=/usr
--exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc
--datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64
--libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib
--mandir=/usr/share/man --infodir=/usr/share/info --with-device=ch3:mrail
--with-rdma=gen2 --with-pmi=pmi2 --with-pm=slurm --enable-g=dbg
--enable-debuginfo build_alias=x86_64-redhat-linux-gnu
host_alias=x86_64-redhat-linux-gnu CFLAGS=-O2 -g -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
LDFLAGS=-Wl,-z,relro -specs=/usr/lib/rpm/redhat/redhat-hardened-ld
CXXFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector-strong --param=ssp-buffer-size=4
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64
-mtune=generic FCFLAGS=-O2 -g -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong
--param=ssp-buffer-size=4 -grecord-gcc-switches
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
-I/usr/lib64/gfortran/modules FFLAGS=-O2 -g -pipe -Wall
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic
-I/usr/lib64/gfortran/modules --no-create --no-recursion


Setting MV2_ON_DEMAND_THRESHOLD=81 allowed the code to complete without an
error. I have not tried disabling shared memory since the above fix worked.
I understand that setting the threshold to a very high value (larger than
the cluster size) will fix our problem. Is there a performance penalty with
doing so?

Thanks,



On Thu, Sep 21, 2017 at 2:15 PM, Hari Subramoni <subramoni.1 at osu.edu> wrote:

> Hi,
>
> Sorry to hear that you're facing issues with MVAPICH2. Could you please
> let us know what application you are using here? Do basic OSU
> Microbenchmark tests like osu_bcast work on the same system/process count
> combinations? Could you please send us the output of mpiname -a?
>
> Can you please try setting "MV2_ON_DEMAND_THRESHOLD=<number of
> processes+1>" and see if the test passes? The following section of the
> MVAPICH2 userguide has more information on this.
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapi
> ch2-2.3b-userguide.html#x1-21800011.43
>
> If the above does not help, can you please try setting
> "MV2_USE_SHMEM_COLL=0" and see if the test passes? The following section
> of the MVAPICH2 userguide has more information on this.
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/
> mvapich2-2.3b-userguide.html#x1-27000011.95
>
> Getting access to the system always helps to debug things faster.
>
> Thx,
> Hari.
>
> On Thu, Sep 21, 2017 at 11:59 AM, Vladimir Florinski <vaf0001 at uah.edu>
> wrote:
>
>> We are having problems with mvapich2 version 2.3b. Applications work as
>> expected when run on 64 or fewer cores, but crash with more than that. This
>> was tested on 16-core, 12-core, and 8-core nodes, with different Infiniband
>> connections (DDR, QDR, FDR) and there is no difference, which implies that
>> the MPI library is the likely culprit. Here is some typical error output:
>>
>>                                               Stack trace of thread 3776:
>>                                                #0  0x00007f6f5d6ce61f
>> MPIDI_CH3_iStartMsgv (libmpi.so.12)
>>                                                #1  0x00007f6f5d6bd15f
>> MPIDI_CH3_EagerContigSend (libmpi.so.12)
>>                                                #2  0x00007f6f5d6c5273
>> MPID_Send (libmpi.so.12)
>>                                                #3  0x00007f6f5d6645d7
>> MPIC_Send (libmpi.so.12)
>>                                                #4  0x00007f6f5d4263b7
>> MPIR_Bcast_binomial (libmpi.so.12)
>>                                                #5  0x00007f6f5d427cc5
>> MPIR_Bcast_intra (libmpi.so.12)
>>                                                #6  0x00007f6f5d48102a
>> MPIR_Bcast_index_tuned_intra_MV2 (libmpi.so.12)
>>                                                #7  0x00007f6f5d47ec34
>> MPIR_Bcast_MV2 (libmpi.so.12)
>>                                                #8  0x00007f6f5d427fb5
>> MPIR_Bcast_intra (libmpi.so.12)
>>                                                #9  0x00007f6f5d48102a
>> MPIR_Bcast_index_tuned_intra_MV2 (libmpi.so.12)
>>                                                #10 0x00007f6f5d47ec34
>> MPIR_Bcast_MV2 (libmpi.so.12)
>>                                                #11 0x00007f6f5d4289bb
>> MPIR_Bcast_impl (libmpi.so.12)
>>                                                #12 0x00007f6f5d4291b9
>> PMPI_Bcast (libmpi.so.12)
>>                                                #13 0x00007f6f5da8bb95
>> pmpi_bcast_ (libmpifort.so.12)
>>                                                #14 0x0000556d8ca67814
>> __level2_subroutines_MOD_init (neutrals_PUIs)
>>                                                #15 0x0000556d8ca6a2c9
>> neutrals (neutrals_PUIs)
>>                                                #16 0x0000556d8ca448c9
>> main (neutrals_PUIs)
>>                                                #17 0x00007f6f5c4c450a
>> __libc_start_main (libc.so.6)
>>                                                #18 0x0000556d8ca4491a
>> _start (neutrals_PUIs)
>>
>>                                                Stack trace of thread
>> 3823:
>>                                                #0  0x00007f6f5c5a41cd
>> __read (libc.so.6)
>>                                                #1  0x00007f6f5b84cdef
>> ibv_get_async_event (libibverbs.so.1)
>>                                                #2  0x00007f6f5d6fdf8b
>> async_thread (libmpi.so.12)
>>                                                #3  0x00007f6f5b00336d
>> start_thread (libpthread.so.0)
>>                                                #4  0x00007f6f5c5b4bbf
>> __clone (libc.so.6)
>>
>>                                                Stack trace of thread
>> 3824:
>>                                                #0  0x00007f6f5c5a41cd
>> __read (libc.so.6)
>>                                                #1  0x00007f6f5b84f95c
>> ibv_get_cq_event (libibverbs.so.1)
>>                                                #2  0x00007f6f5d70bcc9
>> cm_completion_handler (libmpi.so.12)
>>                                                #3  0x00007f6f5b00336d
>> start_thread (libpthread.so.0)
>>                                                #4  0x00007f6f5c5b4bbf
>> __clone (libc.so.6)
>>
>>                                                Stack trace of thread
>> 3826:
>>                                                #0  0x00007f6f5b00990b
>> pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
>>                                                #1  0x00007f6f5d70ba67
>> cm_timeout_handler (libmpi.so.12)
>>                                                #2  0x00007f6f5b00336d
>> start_thread (libpthread.so.0)
>>                                                #3  0x00007f6f5c5b4bbf
>> __clone (libc.so.6)
>>
>> The errors seem to appear whenever there is a collective communication
>> call. The library was built like this
>>
>> --with-device=ch3:mrail --with-rdma=gen2 --with-pmi=pmi2 --with-pm=slurm
>> --enable-g=dbg --enable-debuginfo
>>
>> We need help troubleshooting the issue. We can provide guest access to
>> debug if necessary.
>>
>> Thanks,
>>
>>
>> --
>> Vladimir Florinski
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>


-- 
Vladimir Florinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170921/b5a1347c/attachment-0001.html>


More information about the mvapich-discuss mailing list