[Mvapich-discuss] 2 Issues with data validation on OSU 5.9

Goldman, Adam adam.goldman at intel.com
Wed May 18 17:29:23 EDT 2022


Hello,



I am running OSU 5.9 with data validation and have noticed 2 issues:



1) Running with high ranks/node on "osu_multi_lat" will result in 'Out of Memory' failures:



Configuration:

      48 ranks/node * 4 nodes (192 ranks total)

      Running over OMPI with OFI (psm3 provider).

      Args: "-c"

      Mem Size: 64GB/node



ERROR (Dmesg):

      [107540.289787] Out of memory: Killed process 114599 (osu_multi_lat) total-vm:2278092kB, anon-rss:1636984kB, file-rss:0kB, shmem-rss:1644kB, UID:0 pgtables:4236kB oom_score_adj:0

      [107540.456582] oom_reaper: reaped process 114599 (osu_multi_lat), now anon-rss:0kB, file-rss:0kB, shmem-rss:1644kB



This was easily repeatable, however, if I started at message size 524288 ("-m 524288:") I could get a bit past (2 more message sizes).

I think there might be a memory leak with data validation.



Without data validation I do not use even half the total memory usage.





2) Running pt2pt on CUDA with args "H D" or "D H" will not work



Configuration:

      1 ranks/node * 2 nodes (2 ranks total)

      Running over CUDA enabled OMPI with OFI (psm3 provider).

      Args: "<OSU> -c [DST] [SRC]"



ERROR: (osu_bibw -c D H)

      # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.9

      # Send Buffer on DEVICE (D) and Receive Buffer on HOST (H)

      # Size      Bandwidth (MB/s)        Validation

      [../../util/osu_util_mpi.c:940] CUDA call 'cudaMemcpy((void *)s_buf, (void *)temp_s_buffer, size, cudaMemcpyHostToDevice)' failed with 1: invalid argument



Looks to be repeatable on all pt2pt benchmarks.



Quick look at code shows that we do not check what the src and dst buffers are before calling memcpy/cudaMemcpy.

"Managed" buffers (MH and MD) are also not handled correctly and seem to report false errors on validation.



Regards,



Adam Goldman

Intel Corporation

adam.goldman at intel.com<mailto:adam.goldman at intel.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220518/8a8d6fbe/attachment-0017.html>


More information about the Mvapich-discuss mailing list