[Mvapich-discuss] osu_bw segfault when running with CUDA accelerator and managed buffers

Shineman, Nat shineman.5 at osu.edu
Mon Nov 29 13:45:27 EST 2021


Adam,

Sorry for the delay here. After some internal experimentation we have discovered that OMB was not internally handling the --accelerator​ option correctly. This option should not be used with the pt2pt tests since the two buffers are set with the H/D/MH/MD arguments. However, instead of being ignored as it should, it was causing the benchmark to enter an alternate code path and breaking.

Please continue running your experiments without the --accelerator=cuda​ option and you should get the desired results. In our next release we have included a fix that will ignore this option for pt2pt tests and have added an acknowledgement to you in the Changelog.

As a side note, the M​ buffer option is deprecated. Please use either MH​ or MD​ (managed host and managed device respectively) instead.

Please let me know if you have any questions.

Thanks,
Nat
________________________________
From: Mvapich-discuss <mvapich-discuss-bounces+shineman.5=osu.edu at lists.osu.edu> on behalf of Goldman, Adam via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Friday, November 12, 2021 09:10
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
Cc: DAmbrosio, Cody J <cody.j.dambrosio at intel.com>; Rimmer, Todd <todd.rimmer at intel.com>; Bodner, Anton <anton.bodner at intel.com>
Subject: Re: [Mvapich-discuss] osu_bw segfault when running with CUDA accelerator and managed buffers


Hi sending this again,



We are still seeing issues with he latest osu_bw benchmarks.



-Adam



From: Mvapich-discuss <mvapich-discuss-bounces+adam.goldman=intel.com at lists.osu.edu> On Behalf Of Goldman, Adam via Mvapich-discuss
Sent: Tuesday, November 2, 2021 11:04 AM
To: mvapich-discuss at lists.osu.edu
Cc: DAmbrosio, Cody J <cody.j.dambrosio at intel.com>; Bodner, Anton <anton.bodner at intel.com>; Rimmer, Todd <todd.rimmer at intel.com>
Subject: [Mvapich-discuss] osu_bw segfault when running with CUDA accelerator and managed buffers



Hello,



Hopefully you can help, we may have uncovered an issue in the latest osu_bw test (v5.8).  It seems to crash when given the arguments below, while v5.7 with the exact same arguments and communications stack works fine.



Command:

mpirun --mca mtl ofi -np 2 -H gpu01,gpu02 ./osu-micro-benchmarks-5.8/mpi/pt2pt/osu_bw --accelerator cuda M M



If we remove the “--accelerator cuda” argument, that seems to work.

Also, osu_latency and others appear to work without issue.



BackTrace:

(gdb) bt

#0  0x000014bae79c9c7a in __memmove_sse2_unaligned_erms () from /lib64/libc.so.6

#1  0x000014bae8d1a557 in ?? () from /lib64/libcuda.so.1

#2  0x000014bae8d1a5bc in ?? () from /lib64/libcuda.so.1

#3  0x000014bae8efd2e2 in ?? () from /lib64/libcuda.so.1

#4  0x000014bae8d1e851 in ?? () from /lib64/libcuda.so.1

#5  0x000014bae8d716cc in ?? () from /lib64/libcuda.so.1

#6  0x000014bae8f0fd47 in ?? () from /lib64/libcuda.so.1

#7  0x000014bae8d3280e in ?? () from /lib64/libcuda.so.1

#8  0x000014bae8d33514 in ?? () from /lib64/libcuda.so.1

#9  0x000014bae8f45c0f in ?? () from /lib64/libcuda.so.1

#10 0x000014bae8d83cd7 in cuMemsetD8_v2 () from /lib64/libcuda.so.1

#11 0x000014baea27f460 in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0

#12 0x000014baea25b132 in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0

#13 0x000014baea29c88e in cudaMemset () from /usr/local/cuda/lib64/libcudart.so.11.0

#14 0x00000000004068a3 in set_buffer_pt2pt (buffer=<optimized out>, rank=<optimized out>, type=<optimized out>, data=<optimized out>, size=<optimized out>)

    at ../../util/osu_util_mpi.c:829

#15 0x00000000004028a5 in main (argc=<optimized out>, argv=<optimized out>) at osu_bw.c:136



We have reproduced this repeatably on several systems with different CUDA versions and GPU hardware.



Regards,



Adam Goldman

HPC Fabric Software Engineer

Intel Corporation

adam.goldman at intel.com<mailto:adam.goldman at intel.com>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20211129/d1fd8c73/attachment-0022.html>


More information about the Mvapich-discuss mailing list