[Mvapich-discuss] 2 Issues with data validation on OSU 5.9

Subramoni, Hari subramoni.1 at osu.edu
Wed May 18 18:10:39 EDT 2022


Hi, Adam.

Thanks for the report. Sorry to hear that you’re facing issues.

We will take a look at this and get back to you shortly.

Thx,
Hari.

From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> On Behalf Of Goldman, Adam via Mvapich-discuss
Sent: Wednesday, May 18, 2022 5:29 PM
To: mvapich-discuss at lists.osu.edu
Cc: Heinz, Michael <michael.heinz at intel.com>; Wan, Kaike <kaike.wan at intel.com>
Subject: [Mvapich-discuss] 2 Issues with data validation on OSU 5.9

Hello, I am running OSU 5.9 with data validation and have noticed 2 issues: 1) Running with high ranks/node on "osu_multi_lat" will result in 'Out of Memory' failures: ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
    Report Suspicious  <https://us-phishalarm-ewt.proofpoint.com/EWT/v1/KGKeukY!vwQd8gZND6YgRRdxf65kd2CWQBVMbV4DqdQBL5NPAlklHnGfup4baPPdu-dPmXcOvRX36MnOTKyx76M1X8OWbOWM2CN9uSjyxExNQDPi_lBBJt-bRqEeoOge-JZvCUeOL5guq_AGE3C9EWQ0XcN36w$>   ‌
ZjQcmQRYFpfptBannerEnd

Hello,



I am running OSU 5.9 with data validation and have noticed 2 issues:



1) Running with high ranks/node on "osu_multi_lat" will result in 'Out of Memory' failures:



Configuration:

      48 ranks/node * 4 nodes (192 ranks total)

      Running over OMPI with OFI (psm3 provider).

      Args: "-c"

      Mem Size: 64GB/node



ERROR (Dmesg):

      [107540.289787] Out of memory: Killed process 114599 (osu_multi_lat) total-vm:2278092kB, anon-rss:1636984kB, file-rss:0kB, shmem-rss:1644kB, UID:0 pgtables:4236kB oom_score_adj:0

      [107540.456582] oom_reaper: reaped process 114599 (osu_multi_lat), now anon-rss:0kB, file-rss:0kB, shmem-rss:1644kB



This was easily repeatable, however, if I started at message size 524288 ("-m 524288:") I could get a bit past (2 more message sizes).

I think there might be a memory leak with data validation.



Without data validation I do not use even half the total memory usage.





2) Running pt2pt on CUDA with args "H D" or "D H" will not work



Configuration:

      1 ranks/node * 2 nodes (2 ranks total)

      Running over CUDA enabled OMPI with OFI (psm3 provider).

      Args: "<OSU> -c [DST] [SRC]"



ERROR: (osu_bibw -c D H)

      # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.9

      # Send Buffer on DEVICE (D) and Receive Buffer on HOST (H)

      # Size      Bandwidth (MB/s)        Validation

      [../../util/osu_util_mpi.c:940] CUDA call 'cudaMemcpy((void *)s_buf, (void *)temp_s_buffer, size, cudaMemcpyHostToDevice)' failed with 1: invalid argument



Looks to be repeatable on all pt2pt benchmarks.



Quick look at code shows that we do not check what the src and dst buffers are before calling memcpy/cudaMemcpy.

“Managed” buffers (MH and MD) are also not handled correctly and seem to report false errors on validation.



Regards,



Adam Goldman

Intel Corporation

adam.goldman at intel.com<mailto:adam.goldman at intel.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220518/c8e3faee/attachment-0018.html>


More information about the Mvapich-discuss mailing list