From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss From john at flexcompute.com Tue Jan 11 14:48:30 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:48:30 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Message-ID: Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Jan 11 14:55:28 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 11 Jan 2022 19:55:28 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John From john at flexcompute.com Tue Jan 11 14:58:43 2022 From: john at flexcompute.com (John Moore) Date: Tue, 11 Jan 2022 14:58:43 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar wrote: > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 09:26:06 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 14:26:06 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of John Moore via Mvapich-discuss Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar Cc: Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:14:15 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:14:15 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:16:45 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:16:45 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:19:08 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:19:08 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shineman.5 at osu.edu Wed Jan 12 11:20:42 2022 From: shineman.5 at osu.edu (Shineman, Nat) Date: Wed, 12 Jan 2022 16:20:42 +0000 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: John, Thanks, we will get started on generating this RPM shortly. Nat ________________________________ From: John Moore Sent: Wednesday, January 12, 2022 11:19 To: Shineman, Nat Cc: Panda, Dhabaleswar ; Maitham Alhubail ; mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi Nat, we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 Thanks, John On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat > wrote: Hi John, Can you tell us the ofed version on your system? Thanks, Nat ________________________________ From: John Moore > Sent: Wednesday, January 12, 2022 11:14 To: Shineman, Nat > Cc: Panda, Dhabaleswar >; Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? HI Nat, We have been struggling to get the RPM to work for us -- we've been working on it for about a week. We are using this RPM: http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm If you could build us a custom RPM for our system, that would be very helpful. We're running Ubuntu 20.04 kernel 5.4.0-92-generic GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 CUDA version is CUDA 11.4 CUDA driver: 470.82.01 Please let me know if there is any other information that you need. Thanks, John On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat > wrote: Hi John, You should be able to use the RPMs on Ubuntu by converting them with alien. Regarding the CUDA and compiler versioning, you will want to make sure CUDA is an exact match, but the compiler should only need to be the same major version. You will also want to make sure that you match the mofed major version as well, though we recommend matching the exact version if possible. Please take a look at the download page and see if any of the RPMs there match your needs. Otherwise, we would be happy to generate a custom RPM based on your system specifications. Thanks, Nat ________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 14:58 To: Panda, Dhabaleswar > Cc: Maitham Alhubail >; mvapich-discuss at lists.osu.edu > Subject: Re: [Mvapich-discuss] MVAPICH2 GDR from source code? Hi DK, Do the CUDA and GCC versions on our system need to match the RPM exactly? We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. Thank you, John On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar > wrote: Hi, Thanks for your note. For GPU support with MVAPICH2, it is strongly recommended to use the MVAPICH2-GDR package. This package supports many features related to GPUs and delivers the best performance and scalability on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR download page for your system. Please refer to the corresponding user guide also. The MVAPICH2-GDR package can also be installed through Spack. Let us know if you experience any issues in using the MVAPICH2-GDR package on your GPU cluster. Thanks, DK ________________________________________ From: Mvapich-discuss > on behalf of John Moore via Mvapich-discuss > Sent: Tuesday, January 11, 2022 2:48 PM To: mvapich-discuss at lists.osu.edu Cc: Maitham Alhubail Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? Hello, We have been struggling to get MVAPICH2 to work with cuda-aware support and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda option, but when we run the osu_bibw bandwidth test using Device to Device communication, we get a segmentation fault. Below is the output from osu_bibw using MVAPICH2: MVAPICH2-2.3.6 Parameters --------------------------------------------------------------------- PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD PROCESSOR MODEL NUMBER : 1 HCA NAME : MV2_HCA_MLX_CX_HDR HETEROGENEOUS HCA : NO MV2_EAGERSIZE_1SC : 0 MV2_SMP_EAGERSIZE : 16385 MV2_SMP_QUEUE_LENGTH : 65536 MV2_SMP_NUM_SEND_BUFFER : 16 MV2_SMP_BATCH_SIZE : 8 Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 MV2_HCA_MLX_CX_HDR --------------------------------------------------------------------- # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.07 2 0.15 4 0.29 8 0.57 16 1.12 32 2.30 64 4.75 128 9.41 256 18.44 512 37.22 1024 74.82 2048 144.70 4096 289.96 8192 577.33 [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 471850 RUNNING AT cell3 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== And this is with OpenMPI: # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Bandwidth (MB/s) 1 0.43 2 0.83 4 1.68 8 3.37 16 6.72 32 13.42 64 27.02 128 53.78 256 107.88 512 219.45 1024 437.81 2048 875.12 4096 1747.23 8192 3528.97 16384 7015.15 32768 13973.59 65536 27702.68 131072 51877.67 262144 94556.99 524288 157755.18 1048576 236772.67 2097152 333635.13 4194304 408865.93 Can GDR support be obtained by compiling from source like we are trying to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any recommendations would be greatly appreciated. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 11:21:05 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 11:21:05 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Great, thank you. On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > John, > > Thanks, we will get started on generating this RPM shortly. > > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:19 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi Nat, > > we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 > > Thanks, > John > > On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat wrote: > > Hi John, > > Can you tell us the ofed version on your system? > > Thanks, > Nat > ------------------------------ > *From:* John Moore > *Sent:* Wednesday, January 12, 2022 11:14 > *To:* Shineman, Nat > *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < > maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < > mvapich-discuss at lists.osu.edu> > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > HI Nat, > > We have been struggling to get the RPM to work for us -- we've been > working on it for about a week. We are using this RPM: > > http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm > > If you could build us a custom RPM for our system, that would be very > helpful. > > We're running Ubuntu 20.04 kernel 5.4.0-92-generic > > GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > CUDA version is CUDA 11.4 > CUDA driver: 470.82.01 > > Please let me know if there is any other information that you need. > > Thanks, > John > > > On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: > > Hi John, > > You should be able to use the RPMs on Ubuntu by converting them with > alien. Regarding the CUDA and compiler versioning, you will want to make > sure CUDA is an exact match, but the compiler should only need to be the > same major version. You will also want to make sure that you match the > mofed major version as well, though we recommend matching the exact version > if possible. Please take a look at the download page and see if any of the > RPMs there match your needs. Otherwise, we would be happy to generate a > custom RPM based on your system specifications. > > Thanks, > Nat > ------------------------------ > *From:* Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > *Sent:* Tuesday, January 11, 2022 14:58 > *To:* Panda, Dhabaleswar > *Cc:* Maitham Alhubail ; > mvapich-discuss at lists.osu.edu > *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hi DK, > > Do the CUDA and GCC versions on our system need to match the RPM exactly? > We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. > > Thank you, > John > > On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < > panda at cse.ohio-state.edu> wrote: > > Hi, > > Thanks for your note. For GPU support with MVAPICH2, it is strongly > recommended to use the MVAPICH2-GDR package. This package supports many > features related to GPUs and delivers the best performance and scalability > on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR > download page for your system. Please refer to the corresponding user guide > also. The MVAPICH2-GDR package can also be installed through Spack. Let us > know if you experience any issues in using the MVAPICH2-GDR package on your > GPU cluster. > > Thanks, > > DK > > > ________________________________________ > From: Mvapich-discuss osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < > mvapich-discuss at lists.osu.edu> > Sent: Tuesday, January 11, 2022 2:48 PM > To: mvapich-discuss at lists.osu.edu > Cc: Maitham Alhubail > Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? > > Hello, > > We have been struggling to get MVAPICH2 to work with cuda-aware support > and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda > option, but when we run the osu_bibw bandwidth test using Device to Device > communication, we get a segmentation fault. > > Below is the output from osu_bibw using MVAPICH2: > MVAPICH2-2.3.6 Parameters > --------------------------------------------------------------------- > PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 > PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD > PROCESSOR MODEL NUMBER : 1 > HCA NAME : MV2_HCA_MLX_CX_HDR > HETEROGENEOUS HCA : NO > MV2_EAGERSIZE_1SC : 0 > MV2_SMP_EAGERSIZE : 16385 > MV2_SMP_QUEUE_LENGTH : 65536 > MV2_SMP_NUM_SEND_BUFFER : 16 > MV2_SMP_BATCH_SIZE : 8 > Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 > MV2_HCA_MLX_CX_HDR > --------------------------------------------------------------------- > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.07 > 2 0.15 > 4 0.29 > 8 0.57 > 16 1.12 > 32 2.30 > 64 4.75 > 128 9.41 > 256 18.44 > 512 37.22 > 1024 74.82 > 2048 144.70 > 4096 289.96 > 8192 577.33 > [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault > (signal 11) > [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault > (signal 11) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 471850 RUNNING AT cell3 > = EXIT CODE: 139 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > And this is with OpenMPI: > # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 > # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) > # Size Bandwidth (MB/s) > 1 0.43 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.72 > 32 13.42 > 64 27.02 > 128 53.78 > 256 107.88 > 512 219.45 > 1024 437.81 > 2048 875.12 > 4096 1747.23 > 8192 3528.97 > 16384 7015.15 > 32768 13973.59 > 65536 27702.68 > 131072 51877.67 > 262144 94556.99 > 524288 157755.18 > 1048576 236772.67 > 2097152 333635.13 > 4194304 408865.93 > > > Can GDR support be obtained by compiling from source like we are trying to > do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any > recommendations would be greatly appreciated. > > Thanks, > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at flexcompute.com Wed Jan 12 15:46:07 2022 From: john at flexcompute.com (John Moore) Date: Wed, 12 Jan 2022 15:46:07 -0500 Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? In-Reply-To: References: Message-ID: Hello, While we wait for the RPM, we are trying to get regular MVAPICH2 to work across two of our nodes. We compiled version 2.3.6 from source. We can run the osu_bibw test locally, within a node without errors. However, when we try to run across two nodes, we get the following error: john at cell3:/shared_data/john_dev/osu-micro-benchmarks-5.8/mpi/pt2pt$ MV2_SMP_USE_CMA=0 mpirun -np 2 -hostfile hostfile ./osu_bibw [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/. [cell3:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1 [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell3:mpi_rank_0][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 487] Cannot register vbuf region [cell4:mpi_rank_1][get_vbuf_by_offset] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:632: vbuf pool allocation failed: Cannot allocate memory (12) We found some documentation that said this may be due to the value of the log_num_mtt for OFED. We've found documentation for how to change this, and it involves changing the parameter in /etc/modprobe.d/mlx4_en.conf. However, we do not have any mlx4_* under /etc/modprobe.d, only mlx5_. We are using MLNX_OFED_LINUX-5.5-1.0.3.2 as mentioned above. The output for ulimit -a on both nodes is: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126989 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126989 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any advice on how to resolve this error would be greatly appreciated. Thanks, John On Wed, Jan 12, 2022 at 11:21 AM John Moore wrote: > Great, thank you. > > On Wed, Jan 12, 2022 at 11:20 AM Shineman, Nat wrote: > >> John, >> >> Thanks, we will get started on generating this RPM shortly. >> >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:19 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi Nat, >> >> we are using: MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64 >> >> Thanks, >> John >> >> On Wed, Jan 12, 2022 at 11:16 AM Shineman, Nat >> wrote: >> >> Hi John, >> >> Can you tell us the ofed version on your system? >> >> Thanks, >> Nat >> ------------------------------ >> *From:* John Moore >> *Sent:* Wednesday, January 12, 2022 11:14 >> *To:* Shineman, Nat >> *Cc:* Panda, Dhabaleswar ; Maitham Alhubail < >> maitham at flexcompute.com>; mvapich-discuss at lists.osu.edu < >> mvapich-discuss at lists.osu.edu> >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> HI Nat, >> >> We have been struggling to get the RPM to work for us -- we've been >> working on it for about a week. We are using this RPM: >> >> http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.6/mofed5.4/mvapich2-gdr-cuda11.3.mofed5.4.gnu8.4.1-2.3.6-1.el8.x86_64.rpm >> >> If you could build us a custom RPM for our system, that would be very >> helpful. >> >> We're running Ubuntu 20.04 kernel 5.4.0-92-generic >> >> GCC version is: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> CUDA version is CUDA 11.4 >> CUDA driver: 470.82.01 >> >> Please let me know if there is any other information that you need. >> >> Thanks, >> John >> >> >> On Wed, Jan 12, 2022 at 9:26 AM Shineman, Nat wrote: >> >> Hi John, >> >> You should be able to use the RPMs on Ubuntu by converting them with >> alien. Regarding the CUDA and compiler versioning, you will want to make >> sure CUDA is an exact match, but the compiler should only need to be the >> same major version. You will also want to make sure that you match the >> mofed major version as well, though we recommend matching the exact version >> if possible. Please take a look at the download page and see if any of the >> RPMs there match your needs. Otherwise, we would be happy to generate a >> custom RPM based on your system specifications. >> >> Thanks, >> Nat >> ------------------------------ >> *From:* Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> *Sent:* Tuesday, January 11, 2022 14:58 >> *To:* Panda, Dhabaleswar >> *Cc:* Maitham Alhubail ; >> mvapich-discuss at lists.osu.edu >> *Subject:* Re: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hi DK, >> >> Do the CUDA and GCC versions on our system need to match the RPM exactly? >> We are running on Ubuntu, and there is no GCC 8.4.1 on Ubuntu. >> >> Thank you, >> John >> >> On Tue, Jan 11, 2022 at 2:55 PM Panda, Dhabaleswar < >> panda at cse.ohio-state.edu> wrote: >> >> Hi, >> >> Thanks for your note. For GPU support with MVAPICH2, it is strongly >> recommended to use the MVAPICH2-GDR package. This package supports many >> features related to GPUs and delivers the best performance and scalability >> on GPU clusters. Please use a suitable RPM package from the MVAPICH2-GDR >> download page for your system. Please refer to the corresponding user guide >> also. The MVAPICH2-GDR package can also be installed through Spack. Let us >> know if you experience any issues in using the MVAPICH2-GDR package on your >> GPU cluster. >> >> Thanks, >> >> DK >> >> >> ________________________________________ >> From: Mvapich-discuss > osu.edu at lists.osu.edu> on behalf of John Moore via Mvapich-discuss < >> mvapich-discuss at lists.osu.edu> >> Sent: Tuesday, January 11, 2022 2:48 PM >> To: mvapich-discuss at lists.osu.edu >> Cc: Maitham Alhubail >> Subject: [Mvapich-discuss] MVAPICH2 GDR from source code? >> >> Hello, >> >> We have been struggling to get MVAPICH2 to work with cuda-aware support >> and RDMA. We have compiled MVAPICH2 from source, with the --enable-cuda >> option, but when we run the osu_bibw bandwidth test using Device to Device >> communication, we get a segmentation fault. >> >> Below is the output from osu_bibw using MVAPICH2: >> MVAPICH2-2.3.6 Parameters >> --------------------------------------------------------------------- >> PROCESSOR ARCH NAME : MV2_ARCH_AMD_EPYC_7401_48 >> PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_AMD >> PROCESSOR MODEL NUMBER : 1 >> HCA NAME : MV2_HCA_MLX_CX_HDR >> HETEROGENEOUS HCA : NO >> MV2_EAGERSIZE_1SC : 0 >> MV2_SMP_EAGERSIZE : 16385 >> MV2_SMP_QUEUE_LENGTH : 65536 >> MV2_SMP_NUM_SEND_BUFFER : 16 >> MV2_SMP_BATCH_SIZE : 8 >> Tuning Table: : MV2_ARCH_AMD_EPYC_7401_48 >> MV2_HCA_MLX_CX_HDR >> --------------------------------------------------------------------- >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.7.1 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.07 >> 2 0.15 >> 4 0.29 >> 8 0.57 >> 16 1.12 >> 32 2.30 >> 64 4.75 >> 128 9.41 >> 256 18.44 >> 512 37.22 >> 1024 74.82 >> 2048 144.70 >> 4096 289.96 >> 8192 577.33 >> [cell3:mpi_rank_0][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> [cell3:mpi_rank_1][error_sighandler] Caught error: Segmentation fault >> (signal 11) >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 471850 RUNNING AT cell3 >> = EXIT CODE: 139 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> And this is with OpenMPI: >> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.8 >> # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) >> # Size Bandwidth (MB/s) >> 1 0.43 >> 2 0.83 >> 4 1.68 >> 8 3.37 >> 16 6.72 >> 32 13.42 >> 64 27.02 >> 128 53.78 >> 256 107.88 >> 512 219.45 >> 1024 437.81 >> 2048 875.12 >> 4096 1747.23 >> 8192 3528.97 >> 16384 7015.15 >> 32768 13973.59 >> 65536 27702.68 >> 131072 51877.67 >> 262144 94556.99 >> 524288 157755.18 >> 1048576 236772.67 >> 2097152 333635.13 >> 4194304 408865.93 >> >> >> Can GDR support be obtained by compiling from source like we are trying >> to do or do we have to use an RPM? We export MV2_USE_CUDA=1. Any >> recommendations would be greatly appreciated. >> >> Thanks, >> John >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Joachim.Tscheuschner at dwd.de Fri Jan 21 09:27:51 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Fri, 21 Jan 2022 14:27:51 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> Message-ID: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: From panda at cse.ohio-state.edu Fri Jan 21 09:44:53 2022 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 21 Jan 2022 14:44:53 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark In-Reply-To: <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> References: <4f75d340469849ad91da3681e40b4bea@exch.dwd.de> <7bc0af678f3d41dc84b7d20d206320c9@exch.dwd.de> Message-ID: Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From Joachim.Tscheuschner at dwd.de Wed Jan 26 03:42:40 2022 From: Joachim.Tscheuschner at dwd.de (Tscheuschner Joachim) Date: Wed, 26 Jan 2022 08:42:40 +0000 Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark - ch3:nemesis:ib, tcp Message-ID: <9c2b6bd4573f4a1ba5b5f88f69090e65@exch.dwd.de> Hi, I am aware of the fact, that there is a possibility to use the aws-version, however we do just provide the container and the customers choose the cloud. To minimize the effort we want to use the normal version. In this case the use of tcp for aws would be enough (at the moment). However the question (compare to 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)) Why ./configure --with-device=ch3:nemesis:ib,tcp does not compile (note nemesis:ib is depricated): error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' ..... remains. Am I missing any flags, prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Mvapich-discuss Im Auftrag von mvapich-discuss-request at lists.osu.edu Gesendet: Freitag, 21. Januar 2022 15:45 An: mvapich-discuss at lists.osu.edu Betreff: Mvapich-discuss Digest, Vol 14, Issue 8 Send Mvapich-discuss mailing list submissions to mvapich-discuss at lists.osu.edu To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ or, via email, send a message with subject or body 'help' to mvapich-discuss-request at lists.osu.edu You can reach the person managing the list at mvapich-discuss-owner at lists.osu.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of Mvapich-discuss digest..." Today's Topics: 1. Re: Compiling / Instalation without osu-benchmark (Tscheuschner Joachim) 2. Re: Compiling / Instalation without osu-benchmark (Panda, Dhabaleswar) ---------------------------------------------------------------------- Message: 1 Date: Fri, 21 Jan 2022 14:27:51 +0000 From: Tscheuschner Joachim To: "'Shineman, Nat'" , Tscheuschner Joachim Cc: "Panda, Dhabaleswar" , "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: <7bc0af678f3d41dc84b7d20d206320c9 at exch.dwd.de> Content-Type: text/plain; charset="utf-8" Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. -------------- next part -------------- A non-text attachment was scrubbed... Name: install.mvapich.sh Type: application/octet-stream Size: 2324 bytes Desc: install.mvapich.sh URL: ------------------------------ Message: 2 Date: Fri, 21 Jan 2022 14:44:53 +0000 From: "Panda, Dhabaleswar" To: Tscheuschner Joachim , "Shineman, Nat" Cc: "mvapich-discuss at lists.osu.edu" Subject: Re: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Message-ID: Content-Type: text/plain; charset="utf-8" Hi Joachim, Since AWS HPC instances use EFA adapters (not IB adapters), we have a separate version of MVAPICH2 named `MVAPICH2-X-AWS'. This version is optimized for the AWS EFA adapters. Please use this version (you can download it from the MVAPICH2 download site) and follow the steps mentioned in the associated user guide. Let us know if you encounter any issues with this version. Thanks, DK -----Original Message----- From: Tscheuschner Joachim Sent: Friday, January 21, 2022 9:28 AM To: Shineman, Nat ; Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Subject: AW: Compiling / Instalation without osu-benchmark Hello Nat, we are building a portable container for different cloud-provider. While everything works in our system, testing with AWS-cloud shows a problem. If we compile mvapich2-2.3.6 with ./configure --enable-g=none the communication does not work (aws) [rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does notfully support the HCA found on the system (try with other build options) [cli_25]: aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(493)............: MPID_Init(419)...................: channel initialization failed MPIDI_CH3_Init(471)..............: rdma_get_control_parameters rdma_get_control_parameters(1926): rdma_open_hca rdma_open_hca(1046)..............: No IB device found While with ./configure --enable-g=all and singularity exec ../base.sif mpirun -n 2 -v -genv FI_IFACE=tcp ~/mvapich/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw CPU Affinity is undefined [cli_1]: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........: MPID_Init(400)...............: MPIDI_CH3I_set_affinity(3474): smpi_setaffinity(2719).......: CPU Affinity is undefined. And ./configure --with-device=ch3:nemesis:tcp MPI works again on AWS, BUT as a portable container I will lose ib for other cloud-systems. And, as mentioned in the userguide, ./configure --with-device=ch3:nemesis:ib,tcp does not compile: error: 'MPIDI_CH3I_CH_comm_t' {aka 'struct '} has no member named 'rank_list' The document contains the script to install mvapich2 in a docker-container with Debian:bullseye 10 or 11. Am I missing any flags or prerequests? Cheers Joachim -----Urspr?ngliche Nachricht----- Von: Shineman, Nat Gesendet: Mittwoch, 1. Dezember 2021 19:28 An: Tscheuschner Joachim Cc: mvapich-discuss at lists.osu.edu; Panda, Dhabaleswar ; Subramoni, Hari Betreff: Re: Compiling / Instalation without osu-benchmark Joachim, Thanks for reaching out to us. First off, we would recommend mvapich2-2.3.6 for your use case. mv2-virt is very old and has not been updated for some time. Using a debian based container should be fine. Everything will compile normally from within the container, just ensure that you have GNU Autotools installed. Unfortunately, at this time we do not have a configure time flag for disabling the OMB suite. We will look into adding support for this in our upcoming 2.3.7 release. For the time being we can suggest a workaround. When you run make install? though you should see all of the OMB binaries in a directory called libexec? in your mvapich2 installation directory. If you are installing it in the default location (/usr) you will see a directory /usr/libexec/osu-micro-benchmarks?, in my experience it is typically around 3MB. You can delete this directory to remove the installed binaries from your system. This will have no impact on the rest of the library's functionality. Regarding the other disable options: running ./configure --help? will list all of the configuration options available to you. You are correct that --disable-fortran? is one way to reduce the installation size. Likewise you can use --disable-cxx? to disable C++ bindings if you only wish to have C libraries installed. However, there is no option for "disable all" as many of the enable/disable flags are used to determine feature sets within the library and not actual binaries. Please ensure that you set --enable-g=none? (which should be default) to remove all debugging symbols and reduce size. Other than that, just avoiding enabling additional features should yield the smallest libraries possible. Please let us know if you have any trouble with the installation or if the compiled libraries are still too large and we can see what can be done. Thanks, Nat ________________________________ From: Mvapich-discuss on behalf of Tscheuschner Joachim via Mvapich-discuss Sent: Tuesday, November 30, 2021 08:34 To: 'mvapich-discuss at lists.osu.edu' Subject: [Mvapich-discuss] Compiling / Instalation without osu-benchmark Hi MVAPICH-Team, we want to use containers for a cloud-project. To minimize the container-size &ability, we would like to disable for some, the benchmark-test and compiling ability. Could you give us a recommendation to do so? We would like to use a debian-version(instead of centos) in the container. As far as I understood with <--disable-x> the compiling ability can be reduced, e.g. <--disable-fortran> for fortran. Is there an option to disable all? What is the option to disable the osu-benchmarks? Would you recommend the package or for any cloud? Sincerely Joachim Tscheuschner _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: shineman.5 at osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. ------------------------------ Subject: Digest Footer _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://urldefense.com/v3/__https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9*eW42OztuP245Mz85ajhoM2hqb2k5Mjw8M288PD44Om86aDk4b28zaDxuaC1*Njo9Pzk8PD04OzstemJvNjk7R05hO0o9Ozs5PDI8Jjk7R05hO0o8Ozs5PDI8LXloe382QWRqaGNiZiVfeGhjbn54aGNlbnlLb3xvJW9uLWg2ODstY29nNjs=&url=https*3a*2f*2flists.osu.edu*2fmailman*2flistinfo*2fmvapich-discuss__;Ky8lJSUlJSU!!KGKeukY!m0J72VA1dX6qtl644J8PztmG1c3Tw55XacP4zrawbPpxohkN10OJ1P_6Tn5bUAbQ8xznDRXV8g$ ------------------------------ End of Mvapich-discuss Digest, Vol 14, Issue 8 ********************************************** ---------- Hinweis (DWD): Mindestens eine URL in dieser E-Mail wurde vom Schadsoftware-Erkennungssystem als potentiell gef?hrlich eingestuft und so umgeschrieben, dass ein Klick darauf diese zun?chst auf gef?hrliche Downloads hin untersucht, bevor Inhalte angezeigt werden. Die Analyse erfolgt ohne menschliches Zutun voll automatisiert. Der Datenschutz ist gew?hrleistet. Es kann einige Sekunden dauern, bis die Inhalte angezeigt werden. Der Originalabsender dieser E-Mail ist: mvapich-discuss-bounces at lists.osu.edu. Wenn Ihnen dieser unbekannt ist, so sollten Sie die enthaltenen Links nicht anklicken. From ngagnon at rockportnetworks.com Wed Jan 26 10:40:18 2022 From: ngagnon at rockportnetworks.com (Nicolas Gagnon) Date: Wed, 26 Jan 2022 15:40:18 +0000 Subject: [Mvapich-discuss] FW: OSU_alltoall fail to complete when servers have extra ConnectX-5 In-Reply-To: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> References: <2281DD6B-DC09-48C0-A9C4-CEAB4414B956@rockportnetworks.com> Message-ID: <39E51ACB-D479-4292-BBE7-3B3230B97AA2@rockportnetworks.com> Good day OSU team, I?ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting an extra CX-5 we start seeing problem as below: In this case, I?ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the ?all_cards.cfg? host file, I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I?ve tried multiple versions release to Rockport and still getting into this state. Depending on the version been used I?m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below. Removing the ?MV2_HOMOGENOUS_CLUSTER=1? do not make any difference and explicitly specifying the ?MV2_IBA_HCA=mlx5_0? doesn?t help either. Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn?t find any info in the User?s Guide related to the problem I?m seeing. Likely a configuration issue at my end but I don?t know what I?m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem and capture extra information. I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport. /opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400 -env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100 ssh: connect to host 172.20.141.148 port 22: No route to host ^C[mpiexec at dell-s13-h1] Sending Ctrl-C to processes as requested [mpiexec at dell-s13-h1] Press Ctrl-C again to force abort [mpiexec at dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor) [mpiexec at dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy [mpiexec at dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream [mpiexec at dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion [user at dell-s13-h1 Regards, Nicolas Gagnon Principal Designer/Architect, Engineering ngagnon at rockportnetworks.com Rockport | Simplify the Network [signature_849490256] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 6093 bytes Desc: image001.png URL: From daniel.pou at hpe.com Fri Jan 28 13:20:26 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 18:20:26 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Message-ID: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan From subramoni.1 at osu.edu Fri Jan 28 13:42:17 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 18:42:17 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: Hi, Dan. We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. RPMs for MOFED 5.x should work with MOFED 5.4. Best, Hari. -----Original Message----- From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss Sent: Friday, January 28, 2022 1:20 PM To: mvapich-discuss at lists.osu.edu Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? Thanks, -Dan _______________________________________________ Mvapich-discuss mailing list Mvapich-discuss at lists.osu.edu https://lists.osu.edu/mailman/listinfo/mvapich-discuss From subramoni.1 at osu.edu Fri Jan 28 14:20:02 2022 From: subramoni.1 at osu.edu (Subramoni, Hari) Date: Fri, 28 Jan 2022 19:20:02 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: Hi, Dan. Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. Thx, Hari. -----Original Message----- From: Pou, Dan Sent: Friday, January 28, 2022 2:11 PM To: Subramoni, Hari Cc: mvapich-discuss at lists.osu.edu Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss > On Behalf >Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:10:55 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:10:55 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> Message-ID: <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. Thanks, -Dan On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >Hi, Dan. > >We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. > >Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. > >RPMs for MOFED 5.x should work with MOFED 5.4. > >Best, >Hari. > >-----Original Message----- >From: Mvapich-discuss On Behalf Of Pou, Dan via Mvapich-discuss >Sent: Friday, January 28, 2022 1:20 PM >To: mvapich-discuss at lists.osu.edu >Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? > >Thanks, >-Dan >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss >_______________________________________________ >Mvapich-discuss mailing list >Mvapich-discuss at lists.osu.edu >https://lists.osu.edu/mailman/listinfo/mvapich-discuss From daniel.pou at hpe.com Fri Jan 28 14:38:03 2022 From: daniel.pou at hpe.com (Pou, Dan) Date: Fri, 28 Jan 2022 19:38:03 +0000 Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs In-Reply-To: References: <20220128182025.npc7gbc3pfk32wgp@C02Z72C9LVDN> <20220128191054.de65bfm3pdn3ox5l@C02Z72C9LVDN> Message-ID: <20220128193802.ooogvc4hm4h3rfa3@C02Z72C9LVDN> We are just using the RHEL8 GCC GDR binaries. I think we prefer 8.1 GCC. We always try to support the latest public releases of CUDA, and prefer to have both Slurm and non-Slurm/PBS versions. I think we expect customers to have moved to OFED 5.1+. Thank you for the quick responses. I wish I was able to make it easier to support CCE bindings for MVAPICH2, but the GNU Libtool patches are a challenge, especially in regard to the Fortran frontend. Cheers, -Dan On Fri, Jan 28, 2022 at 07:20:02PM +0000, Subramoni, Hari wrote: >Hi, Dan. > >Please let us know any specific RPMs you would be interested. We can build them for you. We are working on building the ones you'd requested for as we speak. > >Unfortunately, we are still rolling out RHEL8 and newer CUDA versions internally. Once that process is complete, we will be adding newer RPMs to our download page. > >Thx, >Hari. > >-----Original Message----- >From: Pou, Dan >Sent: Friday, January 28, 2022 2:11 PM >To: Subramoni, Hari >Cc: mvapich-discuss at lists.osu.edu >Subject: Re: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs > >Thanks. This is a bummer, as I have a request in for 11.4 RPMs, but it seems unlikely to get them by Monday, which would be the timeline to fit our release schedule. >It would be greatly appreciated if GDR could keep up with latest CUDA minor releases for RHEL8 X86_64 Slurm/PBS, but I understand there are a lot of builds to run. I am surprised that you still seem to have a lot of RHEL7/OFED 4.x support. > >Thanks, >-Dan > > >On Fri, Jan 28, 2022 at 06:42:17PM +0000, Subramoni, Hari via Mvapich-discuss wrote: >>Hi, Dan. >> >>We are working towards making MVAPICH2 independent of the underlying CUDA libraries. It should be available with one of our future releases. >> >>Due to changes in the way CUDA implements some underlying features RPMs for CUDA 11.2 and lower are not compatible with CUDA 11.3 and higher. >> >>RPMs for MOFED 5.x should work with MOFED 5.4. >> >>Best, >>Hari. >> >>-----Original Message----- >>From: Mvapich-discuss >> On Behalf >>Of Pou, Dan via Mvapich-discuss >>Sent: Friday, January 28, 2022 1:20 PM >>To: mvapich-discuss at lists.osu.edu >>Subject: [Mvapich-discuss] GDR CUDA 11.4 X86_64 RPMs >> >>I was curious about the sensitivity for MVAPICH2-GDR to MLNX OFED and CUDA versions. I know that the base open source MVAPICH2 made efforts to reduce dependence on particular versions of OFED. I also see that there is only DT_NEEDED on libcudart.so.11.0 and libcuda.so.1, since it looks like Nvidia doesn't has not incremented any symbol versions. >>Are there any known issues using a MVPICH2-GDR 11.2 with 11.4? >>Are there any known issues using a MVPICH2-GDR OFED 5.x vs 5.4? >> >>Thanks, >>-Dan >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss >>_______________________________________________ >>Mvapich-discuss mailing list >>Mvapich-discuss at lists.osu.edu >>https://lists.osu.edu/mailman/listinfo/mvapich-discuss