[Mvapich-discuss] Memory Leak in CUDA-aware allgather MVAPICH-GDR (2.3.7)

Fri Jan 26 15:15:48 EST 2024

Hi Botao,

Thank you for bringing this issue to our attention. The current Allgather algorithm may exhaust the device memory.
To address this, please try an alternative algorithm by setting MV2_INTER_ALLGATHER_TUNING=3.
I hope this resolves your concern, and don't hesitate to reach out if you have any further questions.

Thanks,
Chen-Chun

From: Mvapich-discuss <mvapich-discuss-bounces+chen.10252=osu.edu at lists.osu.edu> on behalf of Wu, Botao via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Date: Wednesday, January 24, 2024 at 2:41 PM
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
Subject: [Mvapich-discuss] Memory Leak in CUDA-aware allgather MVAPICH-GDR (2.3.7)
Hi,

I'm working on a project which needs to use CUDA-aware MPI. I ran into a problem in that the allgather function in MVAPICH-GDR (2.3.7) used too much Vram.

I'm using the Pitzer cluster at the Ohio Supercomputer Center. I had 1 node with 2 GPUs (without NVlink). I had 2 ranks in total, 1 rank for each GPU. The program runs out of GPU memory very fast and crashes.

I've attached a small program (with code and script) that can reproduce the problem. The attached screenshot highlights the main body of the code.

I would be grateful for any information you can provide.

Thanks,
Botao

Software version:

module load intel/2021.3.0

module load cmake

module load python

module load mkl

module load mvapich2-gdr/2.3.7

module load cuda/11.6.1

export LD_PRELOAD=/apps/mvapich2-gdr/intel/2021.3/2.3.7/lib64/libmpi.so

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20240126/196771dc/attachment-0002.html>