[Mvapich-discuss] 2 Issues with data validation on OSU 5.9
Subramoni, Hari
subramoni.1 at osu.edu
Wed May 18 18:10:39 EDT 2022
Hi, Adam.
Thanks for the report. Sorry to hear that you’re facing issues.
We will take a look at this and get back to you shortly.
Thx,
Hari.
From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> On Behalf Of Goldman, Adam via Mvapich-discuss
Sent: Wednesday, May 18, 2022 5:29 PM
To: mvapich-discuss at lists.osu.edu
Cc: Heinz, Michael <michael.heinz at intel.com>; Wan, Kaike <kaike.wan at intel.com>
Subject: [Mvapich-discuss] 2 Issues with data validation on OSU 5.9
Hello, I am running OSU 5.9 with data validation and have noticed 2 issues: 1) Running with high ranks/node on "osu_multi_lat" will result in 'Out of Memory' failures:
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
Report Suspicious <https://us-phishalarm-ewt.proofpoint.com/EWT/v1/KGKeukY!vwQd8gZND6YgRRdxf65kd2CWQBVMbV4DqdQBL5NPAlklHnGfup4baPPdu-dPmXcOvRX36MnOTKyx76M1X8OWbOWM2CN9uSjyxExNQDPi_lBBJt-bRqEeoOge-JZvCUeOL5guq_AGE3C9EWQ0XcN36w$>
ZjQcmQRYFpfptBannerEnd
Hello,
I am running OSU 5.9 with data validation and have noticed 2 issues:
1) Running with high ranks/node on "osu_multi_lat" will result in 'Out of Memory' failures:
Configuration:
48 ranks/node * 4 nodes (192 ranks total)
Running over OMPI with OFI (psm3 provider).
Args: "-c"
Mem Size: 64GB/node
ERROR (Dmesg):
[107540.289787] Out of memory: Killed process 114599 (osu_multi_lat) total-vm:2278092kB, anon-rss:1636984kB, file-rss:0kB, shmem-rss:1644kB, UID:0 pgtables:4236kB oom_score_adj:0
[107540.456582] oom_reaper: reaped process 114599 (osu_multi_lat), now anon-rss:0kB, file-rss:0kB, shmem-rss:1644kB
This was easily repeatable, however, if I started at message size 524288 ("-m 524288:") I could get a bit past (2 more message sizes).
I think there might be a memory leak with data validation.
Without data validation I do not use even half the total memory usage.
2) Running pt2pt on CUDA with args "H D" or "D H" will not work
Configuration:
1 ranks/node * 2 nodes (2 ranks total)
Running over CUDA enabled OMPI with OFI (psm3 provider).
Args: "<OSU> -c [DST] [SRC]"
ERROR: (osu_bibw -c D H)
# OSU MPI-CUDA Bi-Directional Bandwidth Test v5.9
# Send Buffer on DEVICE (D) and Receive Buffer on HOST (H)
# Size Bandwidth (MB/s) Validation
[../../util/osu_util_mpi.c:940] CUDA call 'cudaMemcpy((void *)s_buf, (void *)temp_s_buffer, size, cudaMemcpyHostToDevice)' failed with 1: invalid argument
Looks to be repeatable on all pt2pt benchmarks.
Quick look at code shows that we do not check what the src and dst buffers are before calling memcpy/cudaMemcpy.
“Managed” buffers (MH and MD) are also not handled correctly and seem to report false errors on validation.
Regards,
Adam Goldman
Intel Corporation
adam.goldman at intel.com<mailto:adam.goldman at intel.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220518/c8e3faee/attachment-0018.html>
More information about the Mvapich-discuss
mailing list