[Mvapich-discuss] Memory Leak in CUDA-aware allgather MVAPICH-GDR (2.3.7)
Chen, Chen Chun
chen.10252 at buckeyemail.osu.edu
Fri Jan 26 15:15:48 EST 2024
Hi Botao,
Thank you for bringing this issue to our attention. The current Allgather algorithm may exhaust the device memory.
To address this, please try an alternative algorithm by setting MV2_INTER_ALLGATHER_TUNING=3.
I hope this resolves your concern, and don't hesitate to reach out if you have any further questions.
Thanks,
Chen-Chun
From: Mvapich-discuss <mvapich-discuss-bounces+chen.10252=osu.edu at lists.osu.edu> on behalf of Wu, Botao via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Date: Wednesday, January 24, 2024 at 2:41 PM
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
Subject: [Mvapich-discuss] Memory Leak in CUDA-aware allgather MVAPICH-GDR (2.3.7)
Hi,
I'm working on a project which needs to use CUDA-aware MPI. I ran into a problem in that the allgather function in MVAPICH-GDR (2.3.7) used too much Vram.
I'm using the Pitzer cluster at the Ohio Supercomputer Center. I had 1 node with 2 GPUs (without NVlink). I had 2 ranks in total, 1 rank for each GPU. The program runs out of GPU memory very fast and crashes.
I've attached a small program (with code and script) that can reproduce the problem. The attached screenshot highlights the main body of the code.
I would be grateful for any information you can provide.
Thanks,
Botao
Software version:
module load intel/2021.3.0
module load cmake
module load python
module load mkl
module load mvapich2-gdr/2.3.7
module load cuda/11.6.1
export LD_PRELOAD=/apps/mvapich2-gdr/intel/2021.3/2.3.7/lib64/libmpi.so
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20240126/196771dc/attachment-0002.html>
More information about the Mvapich-discuss
mailing list