[Mvapich-discuss] Issue with host to device errors

Goldman, Adam adam.goldman at intel.com
Mon Jun 2 16:11:24 EDT 2025


Hello,

When trying to do "osu_bw H D" from a node without a GPU to a node with a GPU I got cuda device failures on the "H" node.
"no CUDA-capable device is detected"

This appears to be caused by the benchmarks calling init_accel()/cleanup_accel() even when the local rank will not be using an accelerator/gpu.

I hacked the code a bit to fix this in an older version (not sure how to do this on newer versions)

  1.  Move init_accel() after rank is valid - after call to "MPI_Comm_rank(MPI_COMM_WORLD, &myid)"
  2.  Update init_accel()/cleanup_accel() call to take a rank as argument .
     *   int init_accel (int rank) {}
  3.  Add checks to *_accel(int rank) funcs to check if rank is not using a GPU.
     *   if (rank == 0 && 'H' == options.src) // skip accel init
     *   if (rank == 1 && 'H' == options.dst) // skip accel init
     *   Not sure how to handle collectives

Regards,
Adam

Adam Goldman
High Performance Networking
Intel Corporation
adam.goldman at intel.com<mailto:adam.goldman at intel.com>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20250602/336dea63/attachment-0001.html>


More information about the Mvapich-discuss mailing list