[mvapich-discuss] GDRCopy Version / Segfault

Chu, Ching-Hsiang chu.368 at buckeyemail.osu.edu
Mon Nov 11 18:02:03 EST 2019


Hi Christopher,

Thanks for trying out MVAPICH2-GDR on the Summit TDS system.

The current version of MVAPICH2-GDR 2.3.2 only supports and is tested against GDRCopy v1.x. The upcoming release of MVAPICH2-GDR will support GDRCopy v2.0.

Thanks,

Ching-Hsiang Chu

________________________________
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> on behalf of Zimmer, Christopher <zimmercj at ornl.gov>
Sent: Monday, November 11, 2019 5:17 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] GDRCopy Version / Segfault

Looking at standing up mvapich-gdr with gdrcopy on peak, (Summit TDS) system and I wanted to verify the version of GDRCopy this is tested against?

The gdrcopy link on  http://mvapich.cse.ohio-state.edu/userguide/gdr/  redirects me to mellanox.com


Currently running
[cjzimmer at h41n10 ~]$ modinfo gdrdrv
filename:       /lib/modules/4.14.0-115.8.1.el7a.ppc64le/kernel/drivers/misc/gdrdrv.ko
version:        2.0
description:    GDRCopy kernel-mode driver
license:        MIT
author:         drossetti at nvidia.com
rhelversion:    7.6
srcversion:     D6AAA99E1E64DADB7B7F3AA
depends:        nv-p2p-dummy
name:           gdrdrv
vermagic:       4.14.0-115.8.1.el7a.ppc64le SMP mod_unload modversions mprofile-kernel
parm:           dbg_enabled:enable debug tracing (int)
parm:           info_enabled:enable info tracing (int)

Copybw sanity tool works.


When running osu_bw seeing a core dump in gdr_map function when called in osu_bw
stack trace -
#0  0x0000200009c517d0 in gdr_map () from /gpfs/alpine/stf008/scratch/cjzimmer/MPI_Build/gdrcopy_install/lib64/libgdrapi.so
#1  0x00002000008fe93c in cuda_ptrcache_insert () from /gpfs/alpine/stf008/scratch/cjzimmer/MPI_Build/mvapich_install/lib64/libmpi.so
#2  0x000020000092d590 in my_cuMemAlloc () from /gpfs/alpine/stf008/scratch/cjzimmer/MPI_Build/mvapich_install/lib64/libmpi.so
#3  0x000020000092d4d8 in cuMemAlloc_v2 () from /gpfs/alpine/stf008/scratch/cjzimmer/MPI_Build/mvapich_install/lib64/libmpi.so
#4  0x0000200001cd2b90 in ?? () from /sw/peak/cuda/10.1.105/lib64/libcudart.so.10.1
#5  0x0000200001ca154c in ?? () from /sw/peak/cuda/10.1.105/lib64/libcudart.so.10.1
#6  0x0000200001ce21d0 in cudaMalloc () from /sw/peak/cuda/10.1.105/lib64/libcudart.so.10.1
#7  0x0000000010006a48 in allocate_device_buffer (buffer=0x20000a5b8f30) at ../../util/osu_util_mpi.c:806
#8  0x000000001000711c in allocate_memory_pt2pt (sbuf=0x7fffdbe94a18, rbuf=0x7fffdbe94a10, rank=<optimized out>) at ../../util/osu_util_mpi.c:988
#9  0x0000000010002174 in main (argc=<optimized out>, argv=<optimized out>) at osu_bw.c:91
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20191111/4f75ccc7/attachment-0001.html>


More information about the mvapich-discuss mailing list