[Mvapich-discuss] Possible bug in OSU Micro-Benchmarks 5.7.1 with cuda

Goldman, Adam adam.goldman at intel.com
Wed Jun 30 13:07:32 EDT 2021


OSU latency looks good

OSU BW does not: 
CMD: ./mpi/pt2pt/osu_bw --accelerator cuda MH MH

*** Error in `./mpi/pt2pt/osu_bw': free(): invalid next size (fast): 0x00000000024e2a50 ***
[hdsmgpu02:02870] *** Process received signal ***
[hdsmgpu02:02870] Signal: Aborted (6)
[hdsmgpu02:02870] Signal code:  (-6)
[hdsmgpu02:02870] [ 0] /lib64/libpthread.so.0(+0x132d0)[0x14953c7462d0]
[hdsmgpu02:02870] [ 1] /lib64/libc.so.6(gsignal+0x110)[0x14953c3b1520]
[hdsmgpu02:02870] [ 2] /lib64/libc.so.6(abort+0x151)[0x14953c3b2b01]
[hdsmgpu02:02870] [ 3] /lib64/libc.so.6(+0x7c957)[0x14953c3f4957]
[hdsmgpu02:02870] [ 4] /lib64/libc.so.6(+0x83173)[0x14953c3fb173]
[hdsmgpu02:02870] [ 5] /lib64/libc.so.6(+0x84a79)[0x14953c3fca79]
[hdsmgpu02:02870] [ 6] ./mpi/pt2pt/osu_bw[0x402aab]
[hdsmgpu02:02870] [ 7] /lib64/libc.so.6(__libc_start_main+0xea)[0x14953c39c34a]
[hdsmgpu02:02870] [ 8] ./mpi/pt2pt/osu_bw[0x402c4a]
[hdsmgpu02:02870] *** End of error message ***
*** Error in `./mpi/pt2pt/osu_bw': free(): invalid next size (fast): 0x0000000001dd0a50 ***
[hdsmgpu01:02649] *** Process received signal ***
[hdsmgpu01:02649] Signal: Aborted (6)
[hdsmgpu01:02649] Signal code:  (-6)
[hdsmgpu01:02649] [ 0] /lib64/libpthread.so.0(+0x132d0)[0x14ff0d3362d0]
[hdsmgpu01:02649] [ 1] /lib64/libc.so.6(gsignal+0x110)[0x14ff0cfa1520]
[hdsmgpu01:02649] [ 2] /lib64/libc.so.6(abort+0x151)[0x14ff0cfa2b01]
[hdsmgpu01:02649] [ 3] /lib64/libc.so.6(+0x7c957)[0x14ff0cfe4957]
[hdsmgpu01:02649] [ 4] /lib64/libc.so.6(+0x83173)[0x14ff0cfeb173]
[hdsmgpu01:02649] [ 5] /lib64/libc.so.6(+0x84a79)[0x14ff0cfeca79]
[hdsmgpu01:02649] [ 6] ./mpi/pt2pt/osu_bw[0x402aab]
[hdsmgpu01:02649] [ 7] /lib64/libc.so.6(__libc_start_main+0xea)[0x14ff0cf8c34a]
[hdsmgpu01:02649] [ 8] ./mpi/pt2pt/osu_bw[0x402c4a]
[hdsmgpu01:02649] *** End of error message ***


> -----Original Message-----
> From: Subramoni, Hari <subramoni.1 at osu.edu>
> Sent: Wednesday, June 30, 2021 10:57 AM
> To: Goldman, Adam <adam.goldman at intel.com>
> Cc: Rimmer, Todd <todd.rimmer at intel.com>; mvapich-discuss at lists.osu.edu;
> Subramoni, Hari <subramoni.1 at osu.edu>
> Subject: RE: [Mvapich-discuss] Possible bug in OSU Micro-Benchmarks 5.7.1
> with cuda
> 
> Hi, Adam.
> 
> Thanks for getting back to us. Glad to hear that it works as expected now.
> We have updated the patch for the other point to point benchmarks and have
> attached the patch here.
> 
> This will be available with the next release of OMB.
> 
> Best,
> Hari.
> 
> -----Original Message-----
> From: Goldman, Adam <adam.goldman at intel.com>
> Sent: Tuesday, June 29, 2021 9:47 AM
> To: Subramoni, Hari <subramoni.1 at osu.edu>
> Cc: Rimmer, Todd <todd.rimmer at intel.com>; mvapich-discuss at lists.osu.edu
> Subject: RE: [Mvapich-discuss] Possible bug in OSU Micro-Benchmarks 5.7.1
> with cuda
> 
> Thank you for the quick response. The last patch appears to work. I tested
> with all combos of 'H' 'D' 'MH' 'MD'.
> 
> -----Original Message-----
> From: Subramoni, Hari <subramoni.1 at osu.edu>
> Sent: Monday, June 28, 2021 5:00 PM
> To: Goldman, Adam <adam.goldman at intel.com>
> Cc: Rimmer, Todd <todd.rimmer at intel.com>; mvapich-discuss at lists.osu.edu;
> Subramoni, Hari <subramoni.1 at osu.edu>
> Subject: RE: [Mvapich-discuss] Possible bug in OSU Micro-Benchmarks 5.7.1
> with cuda
> 
> Please use this version. I sent the wrong version with the last e-mail. I
> will check for correctness and commit an appropriate patch. This will be
> available with our next release with an acknowledgement to you.
> 
> Thx,
> Hari.
> 
> -----Original Message-----
> From: Subramoni, Hari <subramoni.1 at osu.edu>
> Sent: Monday, June 28, 2021 4:49 PM
> To: Goldman, Adam <adam.goldman at intel.com>
> Cc: Rimmer, Todd <todd.rimmer at intel.com>; mvapich-discuss at lists.osu.edu;
> Subramoni, Hari <subramoni.1 at osu.edu>
> Subject: RE: [Mvapich-discuss] Possible bug in OSU Micro-Benchmarks 5.7.1
> with cuda
> 
> Hi, Adam.
> 
> Thanks a lot for identifying this and providing the patch. I've created a
> slightly modified version of the patch. Could you please let me know if
> this works as expected for you?
> 
> I will make similar changes for other benchmarks as well.
> 
> Thx,
> Hari.
> 
> -----Original Message-----
> From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> On Behalf Of
> Goldman, Adam via Mvapich-discuss
> Sent: Monday, June 28, 2021 2:57 PM
> To: mvapich-discuss at lists.osu.edu
> Cc: Rimmer, Todd <todd.rimmer at intel.com>
> Subject: [Mvapich-discuss] Possible bug in OSU Micro-Benchmarks 5.7.1 with
> cuda
> 
> Hello,
> 
> While running the latest osu latency (5.7.1) benchmark with cuda support
> enabled, we encountered a possible bug in the OSU benchmark code.
> It appears that when running osu_latency with one side using "MH" and the
> other "H", the non-cuda managed side will attempt to call cuda calls above
> message size 131072.
> 
> I am using OpenMPI v4.1.1 compiled with cuda support on RHEL 8.1
> 
> # mpirun -np 2 --host host1,host2 ./mpi/pt2pt/osu_latency -m 131072: MH H
> ...
> 131072                512.90
> [../../util/osu_util_mpi.c:1691] CUDA call 'cudaMemPrefetchAsync(buf,
> length, devid, um_stream)' failed with 1: invalid argument
> 
> From some debugging, it appears to be passing in a pointer to code
> allocated without cuda calls on the node that is not using cuda.
> This issue appears to be new to v5.7.1.
> 
> Not sure if this is the fix, but this seemed to fix the issue on
> osu_latency.c:
> ================================
> @@ -134,9 +134,9 @@
> 
>          for(i = 0; i < options.iterations + options.skip; i++) {  #ifdef
> _ENABLE_CUDA_
> -            if (options.src == 'M') {
> +            if (myid == 0) {
>                  touch_managed_src(s_buf, size);
> -            } else if (options.dst == 'M') {
> +            } else {
>                  touch_managed_dst(s_buf, size);
>              }
>  #endif
> @@ -149,8 +149,8 @@
>                  MPI_CHECK(MPI_Send(s_buf, size, MPI_CHAR, 1, 1,
> MPI_COMM_WORLD));
>                  MPI_CHECK(MPI_Recv(r_buf, size, MPI_CHAR, 1, 1,
> MPI_COMM_WORLD, &reqstat));  #ifdef _ENABLE_CUDA_
> -                if (options.src == 'M') {
> -                    touch_managed_src(r_buf, size);
> -                }
> +                touch_managed_src(r_buf, size);
>  #endif
> 
> @@ -161,9 +161,7 @@
>              } else if (myid == 1) {
>                  MPI_CHECK(MPI_Recv(r_buf, size, MPI_CHAR, 0, 1,
> MPI_COMM_WORLD, &reqstat));  #ifdef _ENABLE_CUDA_
> -                if (options.dst == 'M') {
> -                    touch_managed_dst(r_buf, size);
> -                }
> +                touch_managed_dst(r_buf, size);
>  #endif
> 
>                  MPI_CHECK(MPI_Send(s_buf, size, MPI_CHAR, 0, 1,
> MPI_COMM_WORLD)); ================================ Only the 1st change is
> required, but the last 2 are more just cleanups to avoid calling the same
> if expression twice. Once outside and once inside the functions.
> 
> Thank you,
> 
> Adam Goldman
> HPC Fabric Software Engineer
> Intel Corporation
> adam.goldman at intel.com
> 
> _______________________________________________
> Mvapich-discuss mailing list
> Mvapich-discuss at lists.osu.edu
> https://lists.osu.edu/mailman/listinfo/mvapich-discuss



More information about the Mvapich-discuss mailing list