[mvapich-discuss] RESEND Re: mvapich2-2.3.2 crash when CPU affinity is enabled]
Subramoni, Hari
subramoni.1 at osu.edu
Wed Mar 25 22:50:09 EDT 2020
Hi, Honggang.
We have some follow up questions and clarifications.
>From the original backtrace you sent, the failure seemed to be in a hwloc function that is used to get intra-node processor architecture. Given this, the following statement is a little confusing for us.
"It seems it is a Mellanox MLX5 specific issue. We had test mlx4 and OPA/HFI1. It works for me with mlx4 and OPA"
Could you please let us know what you mean by this? Did you happen to test MVAPICH2 2.3.3GA with OPA/HFI1 NICs and Mellanox MLX4 NICs on the same compute node and had everything work fine? Did the failure happen only when you installed a Mellanox MLX5 NIC on the same compute node?
Could you also send us the output of "lscpu" for the compute node in question?
Regards,
Hari.
-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Honggang LI
Sent: Wednesday, March 25, 2020 9:17 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] RESEND Re: mvapich2-2.3.2 crash when CPU affinity is enabled]
Resend this email as our mail system notified me that the previous reply was lost.
On Wed, Mar 25, 2020 at 06:20:37AM +0000, Hashmi, Jahanzeb wrote:
> Hi,
> Sorry to know that you are facing issues with mvapich2-2.3.3. We have
> tried to reproduce this at our end with the information you provided,
> however we are unable to reproduce the issue. The build configuration and
> the output is given below. Could you please let us know your system
> configuration e.g., architecture, kernel version, hostfile format
> (intra/inter), compiler version?
[root at rdma-qe-06 ~]$ uname -r
4.18.0-189.el8.x86_64
[root at rdma-qe-06 ~]$ cat hfile_one_core
172.31.0.6
172.31.0.7
[root at rdma-qe-06 ~]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://urldefense.com/v3/__http://bugzilla.redhat.com/bugzilla__;!!KGKeukY!k1joY393Q6yqshxcgQtO_CtKQ3JM9rBiGhMVQ4gsPtzEW-SKf5To1qzEvlNRM6B77QQ-jbb6hWB6KK8$ --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --disable-libmpx --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)
I also attached the "config.log" of mvapich2-2.3.3 . It seems it is a mellanox MLX5 specific issue. We had test mlx4 and OPA/HFI1. It works for me with mlx4 and OPA.
https://urldefense.com/v3/__http://people.redhat.com/honli/mvapich2/config.log__;!!KGKeukY!k1joY393Q6yqshxcgQtO_CtKQ3JM9rBiGhMVQ4gsPtzEW-SKf5To1qzEvlNRM6B77QQ-jbb6kP1i_to$
> [hashmij at haswell3 mvapich2-2.3.3]$ ./install/bin/mpiname -a
> MVAPICH2 2.3.3 Thu January 09 22:00:00 EST 2019 ch3:mrail
> Compilation
> CC: gcc -DNDEBUG -DNVALGRIND -g -O2
> CXX: g++ -DNDEBUG -DNVALGRIND -g -O2
> F77: gfortran -L/lib -L/lib -g -O2
> FC: gfortran -g -O2
> Configuration
> --prefix=/home/hashmij/release-testing/mvapich2-2.3.3/install
> --enable-error-messages=all --enable-g=dbg,debug
> [hashmij at haswell3 mvapich2-2.3.3]$ ./install/bin/mpirun -genv
> MV2_ENABLE_AFFINITY 1 -genv MV2_DEBUG_SHOW_BACKTRACE 1 -hostfile ~/hosts
> -np 2 ./install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
> # OSU MPI Bandwidth Test v5.6.2
> # Size Bandwidth (MB/s)
> 1 7.33
> 2 14.43
> 4 30.03
> 8 60.00
> 16 119.77
> 32 237.23
> 64 469.27
> 128 861.36
> 256 1616.93
> 512 2804.30
> 1024 3900.08
> 2048 5328.12
> 4096 6738.44
> 8192 8009.26
> 16384 8477.34
> 32768 12230.39
> 65536 13757.87
> 131072 13768.37
> 262144 12064.52
> 524288 11696.69
> 1048576 11877.53
> 2097152 11720.65
> 4194304 10932.82
The bandwidth is high. It is seems you were not running the test with mlx5.
thanks
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list