[mvapich-discuss] Memory leaks were detected when checking with cuda-memcheck

Subramoni, Hari subramoni.1 at osu.edu
Thu Oct 18 07:27:48 EDT 2018


Dear, Yussuf.

Can you please try setting MV2_CUDA_ENABLE_IPC_CACHE=0 instead of MV2_CUDA_IPC=0 and see if it solves the issue?

Thx,
Hari.

-----Original Message-----
From: Yussuf Ali <yussuf.ali at jaea.go.jp> 
Sent: Thursday, October 18, 2018 2:07 AM
To: Subramoni, Hari <subramoni.1 at osu.edu>; mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: RE: [mvapich-discuss] Memory leaks were detected when checking with cuda-memcheck

Dear MVAPICH developers and users,

we tested the new rc1 version. But the result is the same. The whole background of the cuda-memcheck testing is, we have a large application which generates wrong results after some time when we use MVAPICH GDR 23a.
With after some time I mean initially the output is correct but later the values seem to be uninitialized. 

We tried to track it down, so far we discovered:
1.) When we use OpenMPI (utilizing nvlink) the application runs correct until the end
2.) When we set MV2_CUDA_IPC=0 the application runs correct with MVAPICH-GDR until the end
3.) When we manually stage the data through the host memory the application runs correct with MVAPICH-GDR until the end

The structure of the program is as follows:

MPI_Init()
for(i to max)
{
   cudaMalloc(all GPU memory)
   for(k to max1)
   {	
         for(1 to s)
         {
             exchange data with up to six other ranks using Isend and Irecv
         }
         Allreduce()
         Allreduce()
    }
    cudaFree(all GPU memory)
}
MPI_Finalize()

When all cudaFree calls are replaced by a single cudaDeviceReset() in order to make sure that all the memory is freed before the next iteration starts, then at the start of the next iterations the following error message is
generated:

[x002:mpi_rank_1][MPIDI_CH3I_MRAIL_Rndv_transfer_cuda_ipc]
src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:368:
cudaIpcGetEventHandle failed: File exists (17) [x002:mpi_rank_3][MPIDI_CH3I_MRAIL_Rndv_transfer_cuda_ipc]
src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:368:
cudaIpcGetEventHandle failed: File exists (17) [x002:mpi_rank_0][MPIDI_CH3I_MRAIL_Rndv_transfer_cuda_ipc]
src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:368:
cudaIpcGetEventHandle failed: File exists (17) [x002:mpi_rank_2][MPIDI_CH3I_MRAIL_Rndv_transfer_cuda_ipc]
src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_rndv.c:368:
cudaIpcGetEventHandle failed: File exists (17)

We have two questions, is there a flag in order to get more output from MVAPICH in order to know what's going on behind the scenes? 
Is it allowed to call cudaDeviceReset() between MPI_Init() and
MPI_Finalize() or does it free some internal MVAPICH data structures which are used for management?


Thank you for your help,
Yussuf

-----Original Message-----
From: Subramoni, Hari [mailto:subramoni.1 at osu.edu]
Sent: Wednesday, October 10, 2018 7:59 PM
To: Yussuf Ali <yussuf.ali at jaea.go.jp>; mvapich-discuss at cse.ohio-state.edu
<mvapich-discuss at mailman.cse.ohio-state.edu>
Cc: Subramoni, Hari <subramoni.1 at osu.edu>
Subject: RE: [mvapich-discuss] Memory leaks were detected when checking with cuda-memcheck

Dear, Yusuf.

Thanks a lot for the report. We appreciate it.

We recently released MVAPICH2-GDR 2.3rc1. Could you please see if some of these have been resolved?

We will look at the issues in parallel internally using the cuda-memcheck tool you mentioned.

Best Regards,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Yussuf Ali
Sent: Wednesday, October 10, 2018 12:28 AM
To: mvapich-discuss at cse.ohio-state.edu
<mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Memory leaks were detected when checking with cuda-memcheck

Dear MVAPICH-GDR developers and users,

we were using the cuda-memcheck tool to check for memory leaks in a small example program which uses MVAPCH-GDR 2.3a.
At the end of the program several leaks were detected, however we are not sure where these leaks come from. Is something wrong with this MPI program?

We set MV2_USE_CUDA=1 and run the program is executed with the following command using two MPI processes :
mpiexec cuda-memcheck --leak-check full program a.out

According to the NVIDIA documentation cudaDeviceReset() is necessary in order to detect memory leaks, so this function call was inserted after MPI_Finalize.

#include <stdio.h>
#include <cuda.h>
#include "mpi.h"

int main(int argc, char *argv[])
{
  MPI_Init(&argc, &argv);
  int rank = 0;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  cudaSetDevice(rank);

  double *s_buf;
  double *r_buf;
  cudaMalloc(&s_buf,sizeof(double));
  cudaMalloc(&r_buf,sizeof(double));

  int dst = -1;
  if(rank==0)dst = 1;
  else dst = 0;
  double data = (rank == 0 ? 0.0 : 1.0);
  printf("Rank: %i Data is: %f \n", rank, data);

  cudaMemcpy(s_buf, &data, sizeof(double),cudaMemcpyHostToDevice);
  MPI_Request req[2];
  MPI_Irecv(r_buf, 1, MPI_DOUBLE, dst, 123, MPI_COMM_WORLD, &req[0]);
  MPI_Isend(s_buf, 1, MPI_DOUBLE, dst, 123, MPI_COMM_WORLD, &req[1]);

  MPI_Waitall(2, req, MPI_STATUSES_IGNORE);

  double check;
  cudaMemcpy(&check,r_buf,sizeof(double),cudaMemcpyDeviceToHost);
  MPI_Barrier(MPI_COMM_WORLD);
  printf("Rank: %i Received: %f \n", rank, check);

  cudaFree(s_buf);
  cudaFree(r_buf);
  cudaDeviceSynchronize();
  MPI_Finalize();
  cudaDeviceReset();
  return 0;
}

The output of cuda-memcheck is:

========= CUDA-MEMCHECK
========= Leaked 524288 bytes at 0x7fff98d80000
=========     Saved host backtrace up to driver entry point at cudaMalloc
time
=========     Host Frame:/lib64/libcuda.so.1 (cuMemAlloc_v2 + 0x17f)
[0x22bedf]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so [0x3cd0e0]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
[0x31b73]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
[0x10d7b]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
(cudaMalloc + 0x178) [0x42138]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so
(cudaipc_allocate_ipc_region + 0x119) [0x3c79e9]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so
(cuda_init_dynamic + 0x31d) [0x3c175d]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (is_device_buffer
+ 0x124) [0x3c19a4]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (MPID_Irecv +
0x1e0) [0x35a130]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (MPI_Irecv +
0x5a5) [0x2df4d5]
=========     Host Frame:./a.out [0x1171]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5)
[0x21b15]
=========     Host Frame:./a.out [0xf69]
=========
========= Leaked 524288 bytes at 0x7fff98d00000
=========     Saved host backtrace up to driver entry point at cudaMalloc
time
=========     Host Frame:/lib64/libcuda.so.1 (cuMemAlloc_v2 + 0x17f)
[0x22bedf]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so [0x3cd0e0]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
[0x31b73]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
[0x10d7b]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
(cudaMalloc + 0x178) [0x42138]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so
(cudaipc_allocate_ipc_region + 0x119) [0x3c79e9]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so
(cuda_init_dynamic + 0x31d) [0x3c175d]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (is_device_buffer
+ 0x124) [0x3c19a4]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (MPID_Irecv +
0x1e0) [0x35a130]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (MPI_Irecv +
0x5a5) [0x2df4d5]
=========     Host Frame:./a.out [0x1171]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5)
[0x21b15]
=========     Host Frame:./a.out [0xf69]
=========
========= LEAK SUMMARY: 1048576 bytes leaked in 2 allocations ========= ERROR SUMMARY: 2 errors ========= CUDA-MEMCHECK ========= Leaked 524288 bytes at 0x7fff98c80000
=========     Saved host backtrace up to driver entry point at cudaMalloc
time
=========     Host Frame:/lib64/libcuda.so.1 (cuMemAlloc_v2 + 0x17f)
[0x22bedf]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so [0x3cd0e0]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
[0x31b73]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
[0x10d7b]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
(cudaMalloc + 0x178) [0x42138]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so
(cudaipc_allocate_ipc_region + 0x119) [0x3c79e9]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so
(cuda_init_dynamic + 0x31d) [0x3c175d]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (is_device_buffer
+ 0x124) [0x3c19a4]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (MPID_Irecv +
0x1e0) [0x35a130]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (MPI_Irecv +
0x5a5) [0x2df4d5]
=========     Host Frame:./a.out [0x1171]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5)
[0x21b15]
=========     Host Frame:./a.out [0xf69]
=========
========= Leaked 524288 bytes at 0x7fff98c00000
=========     Saved host backtrace up to driver entry point at cudaMalloc
time
=========     Host Frame:/lib64/libcuda.so.1 (cuMemAlloc_v2 + 0x17f)
[0x22bedf]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so [0x3cd0e0]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
[0x31b73]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
[0x10d7b]
=========     Host Frame:/lustre/app/acc/cuda/9.0.176/lib64/libcudart.so.9.0
(cudaMalloc + 0x178) [0x42138]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so
(cudaipc_allocate_ipc_region + 0x119) [0x3c79e9]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so
(cuda_init_dynamic + 0x31d) [0x3c175d]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (is_device_buffer
+ 0x124) [0x3c19a4]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (MPID_Irecv +
0x1e0) [0x35a130]
=========     Host
Frame:/lustre/app/mvapich2-gdr/ofed4.2/gnu/lib64/libmpi.so (MPI_Irecv +
0x5a5) [0x2df4d5]
=========     Host Frame:./a.out [0x1171]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5)
[0x21b15]
=========     Host Frame:./a.out [0xf69]
=========
========= LEAK SUMMARY: 1048576 bytes leaked in 2 allocations ========= ERROR SUMMARY: 2 errors

How can we resolve this issue?

Thank you for your help,
Yussuf








More information about the mvapich-discuss mailing list