[Mvapich-discuss] osu_bw segfault when running with CUDA accelerator and managed buffers
Shineman, Nat
shineman.5 at osu.edu
Mon Nov 29 13:45:27 EST 2021
Adam,
Sorry for the delay here. After some internal experimentation we have discovered that OMB was not internally handling the --accelerator option correctly. This option should not be used with the pt2pt tests since the two buffers are set with the H/D/MH/MD arguments. However, instead of being ignored as it should, it was causing the benchmark to enter an alternate code path and breaking.
Please continue running your experiments without the --accelerator=cuda option and you should get the desired results. In our next release we have included a fix that will ignore this option for pt2pt tests and have added an acknowledgement to you in the Changelog.
As a side note, the M buffer option is deprecated. Please use either MH or MD (managed host and managed device respectively) instead.
Please let me know if you have any questions.
Thanks,
Nat
________________________________
From: Mvapich-discuss <mvapich-discuss-bounces+shineman.5=osu.edu at lists.osu.edu> on behalf of Goldman, Adam via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Friday, November 12, 2021 09:10
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
Cc: DAmbrosio, Cody J <cody.j.dambrosio at intel.com>; Rimmer, Todd <todd.rimmer at intel.com>; Bodner, Anton <anton.bodner at intel.com>
Subject: Re: [Mvapich-discuss] osu_bw segfault when running with CUDA accelerator and managed buffers
Hi sending this again,
We are still seeing issues with he latest osu_bw benchmarks.
-Adam
From: Mvapich-discuss <mvapich-discuss-bounces+adam.goldman=intel.com at lists.osu.edu> On Behalf Of Goldman, Adam via Mvapich-discuss
Sent: Tuesday, November 2, 2021 11:04 AM
To: mvapich-discuss at lists.osu.edu
Cc: DAmbrosio, Cody J <cody.j.dambrosio at intel.com>; Bodner, Anton <anton.bodner at intel.com>; Rimmer, Todd <todd.rimmer at intel.com>
Subject: [Mvapich-discuss] osu_bw segfault when running with CUDA accelerator and managed buffers
Hello,
Hopefully you can help, we may have uncovered an issue in the latest osu_bw test (v5.8). It seems to crash when given the arguments below, while v5.7 with the exact same arguments and communications stack works fine.
Command:
mpirun --mca mtl ofi -np 2 -H gpu01,gpu02 ./osu-micro-benchmarks-5.8/mpi/pt2pt/osu_bw --accelerator cuda M M
If we remove the “--accelerator cuda” argument, that seems to work.
Also, osu_latency and others appear to work without issue.
BackTrace:
(gdb) bt
#0 0x000014bae79c9c7a in __memmove_sse2_unaligned_erms () from /lib64/libc.so.6
#1 0x000014bae8d1a557 in ?? () from /lib64/libcuda.so.1
#2 0x000014bae8d1a5bc in ?? () from /lib64/libcuda.so.1
#3 0x000014bae8efd2e2 in ?? () from /lib64/libcuda.so.1
#4 0x000014bae8d1e851 in ?? () from /lib64/libcuda.so.1
#5 0x000014bae8d716cc in ?? () from /lib64/libcuda.so.1
#6 0x000014bae8f0fd47 in ?? () from /lib64/libcuda.so.1
#7 0x000014bae8d3280e in ?? () from /lib64/libcuda.so.1
#8 0x000014bae8d33514 in ?? () from /lib64/libcuda.so.1
#9 0x000014bae8f45c0f in ?? () from /lib64/libcuda.so.1
#10 0x000014bae8d83cd7 in cuMemsetD8_v2 () from /lib64/libcuda.so.1
#11 0x000014baea27f460 in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0
#12 0x000014baea25b132 in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0
#13 0x000014baea29c88e in cudaMemset () from /usr/local/cuda/lib64/libcudart.so.11.0
#14 0x00000000004068a3 in set_buffer_pt2pt (buffer=<optimized out>, rank=<optimized out>, type=<optimized out>, data=<optimized out>, size=<optimized out>)
at ../../util/osu_util_mpi.c:829
#15 0x00000000004028a5 in main (argc=<optimized out>, argv=<optimized out>) at osu_bw.c:136
We have reproduced this repeatably on several systems with different CUDA versions and GPU hardware.
Regards,
Adam Goldman
HPC Fabric Software Engineer
Intel Corporation
adam.goldman at intel.com<mailto:adam.goldman at intel.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20211129/d1fd8c73/attachment-0022.html>
More information about the Mvapich-discuss
mailing list