[Mvapich-discuss] 2 Issues with data validation on OSU 5.9
Goldman, Adam
adam.goldman at intel.com
Wed May 18 17:29:23 EDT 2022
Hello,
I am running OSU 5.9 with data validation and have noticed 2 issues:
1) Running with high ranks/node on "osu_multi_lat" will result in 'Out of Memory' failures:
Configuration:
48 ranks/node * 4 nodes (192 ranks total)
Running over OMPI with OFI (psm3 provider).
Args: "-c"
Mem Size: 64GB/node
ERROR (Dmesg):
[107540.289787] Out of memory: Killed process 114599 (osu_multi_lat) total-vm:2278092kB, anon-rss:1636984kB, file-rss:0kB, shmem-rss:1644kB, UID:0 pgtables:4236kB oom_score_adj:0
[107540.456582] oom_reaper: reaped process 114599 (osu_multi_lat), now anon-rss:0kB, file-rss:0kB, shmem-rss:1644kB
This was easily repeatable, however, if I started at message size 524288 ("-m 524288:") I could get a bit past (2 more message sizes).
I think there might be a memory leak with data validation.
Without data validation I do not use even half the total memory usage.
2) Running pt2pt on CUDA with args "H D" or "D H" will not work
Configuration:
1 ranks/node * 2 nodes (2 ranks total)
Running over CUDA enabled OMPI with OFI (psm3 provider).
Args: "<OSU> -c [DST] [SRC]"
ERROR: (osu_bibw -c D H)
# OSU MPI-CUDA Bi-Directional Bandwidth Test v5.9
# Send Buffer on DEVICE (D) and Receive Buffer on HOST (H)
# Size Bandwidth (MB/s) Validation
[../../util/osu_util_mpi.c:940] CUDA call 'cudaMemcpy((void *)s_buf, (void *)temp_s_buffer, size, cudaMemcpyHostToDevice)' failed with 1: invalid argument
Looks to be repeatable on all pt2pt benchmarks.
Quick look at code shows that we do not check what the src and dst buffers are before calling memcpy/cudaMemcpy.
"Managed" buffers (MH and MD) are also not handled correctly and seem to report false errors on validation.
Regards,
Adam Goldman
Intel Corporation
adam.goldman at intel.com<mailto:adam.goldman at intel.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220518/8a8d6fbe/attachment-0017.html>
More information about the Mvapich-discuss
mailing list