[mvapich-discuss] RESEND Re: mvapich2-2.3.2 crash when CPU affinity is enabled]

Subramoni, Hari subramoni.1 at osu.edu
Wed Mar 25 22:50:09 EDT 2020


Hi, Honggang.

We have some follow up questions and clarifications.

>From the original backtrace you sent, the failure seemed to be in a hwloc function that is used to get intra-node processor architecture. Given this, the following statement is a little confusing for us.

"It seems it is a Mellanox MLX5 specific issue. We had test mlx4 and OPA/HFI1. It works for me with mlx4 and OPA"

Could you please let us know what you mean by this? Did you happen to test MVAPICH2 2.3.3GA with OPA/HFI1 NICs and Mellanox MLX4 NICs on the same compute node and had everything work fine? Did the failure happen only when you installed a Mellanox MLX5 NIC on the same compute node?

Could you also send us the output of "lscpu" for the compute node in question?

Regards,
Hari.

-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Honggang LI
Sent: Wednesday, March 25, 2020 9:17 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] RESEND Re: mvapich2-2.3.2 crash when CPU affinity is enabled]

Resend this email as our mail system notified me that the previous reply was lost.

On Wed, Mar 25, 2020 at 06:20:37AM +0000, Hashmi, Jahanzeb wrote:

>    Hi,
>    Sorry to know that you are facing issues with mvapich2-2.3.3. We have
>    tried to reproduce this at our end with the information you provided,
>    however we are unable to reproduce the issue. The build configuration and
>    the output is given below. Could you please let us know your system
>    configuration e.g., architecture, kernel version, hostfile format
>    (intra/inter), compiler version?

[root at rdma-qe-06 ~]$  uname -r
4.18.0-189.el8.x86_64

[root at rdma-qe-06 ~]$ cat hfile_one_core
172.31.0.6
172.31.0.7

[root at rdma-qe-06 ~]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://urldefense.com/v3/__http://bugzilla.redhat.com/bugzilla__;!!KGKeukY!k1joY393Q6yqshxcgQtO_CtKQ3JM9rBiGhMVQ4gsPtzEW-SKf5To1qzEvlNRM6B77QQ-jbb6hWB6KK8$  --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --disable-libmpx --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)

I also attached the "config.log" of mvapich2-2.3.3 . It seems it is a mellanox MLX5 specific issue. We had test mlx4 and OPA/HFI1. It works for me with mlx4 and OPA.

https://urldefense.com/v3/__http://people.redhat.com/honli/mvapich2/config.log__;!!KGKeukY!k1joY393Q6yqshxcgQtO_CtKQ3JM9rBiGhMVQ4gsPtzEW-SKf5To1qzEvlNRM6B77QQ-jbb6kP1i_to$ 

>    [hashmij at haswell3 mvapich2-2.3.3]$ ./install/bin/mpiname -a
>    MVAPICH2 2.3.3 Thu January 09 22:00:00 EST 2019 ch3:mrail
>    Compilation
>    CC: gcc    -DNDEBUG -DNVALGRIND -g -O2
>    CXX: g++   -DNDEBUG -DNVALGRIND -g -O2
>    F77: gfortran -L/lib -L/lib   -g -O2
>    FC: gfortran   -g -O2
>    Configuration
>    --prefix=/home/hashmij/release-testing/mvapich2-2.3.3/install
>    --enable-error-messages=all --enable-g=dbg,debug
>    [hashmij at haswell3 mvapich2-2.3.3]$ ./install/bin/mpirun -genv
>    MV2_ENABLE_AFFINITY 1 -genv MV2_DEBUG_SHOW_BACKTRACE 1 -hostfile ~/hosts
>    -np 2 ./install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
>    # OSU MPI Bandwidth Test v5.6.2
>    # Size      Bandwidth (MB/s)
>    1                       7.33
>    2                      14.43
>    4                      30.03
>    8                      60.00
>    16                    119.77
>    32                    237.23
>    64                    469.27
>    128                   861.36
>    256                  1616.93
>    512                  2804.30
>    1024                 3900.08
>    2048                 5328.12
>    4096                 6738.44
>    8192                 8009.26
>    16384                8477.34
>    32768               12230.39
>    65536               13757.87
>    131072              13768.37
>    262144              12064.52
>    524288              11696.69
>    1048576             11877.53
>    2097152             11720.65
>    4194304             10932.82

The bandwidth is high. It is seems you were not running the test with mlx5.

thanks


_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list