[Mvapich-discuss] Issue with host to device errors
Goldman, Adam
adam.goldman at intel.com
Mon Jun 2 16:11:24 EDT 2025
Hello,
When trying to do "osu_bw H D" from a node without a GPU to a node with a GPU I got cuda device failures on the "H" node.
"no CUDA-capable device is detected"
This appears to be caused by the benchmarks calling init_accel()/cleanup_accel() even when the local rank will not be using an accelerator/gpu.
I hacked the code a bit to fix this in an older version (not sure how to do this on newer versions)
1. Move init_accel() after rank is valid - after call to "MPI_Comm_rank(MPI_COMM_WORLD, &myid)"
2. Update init_accel()/cleanup_accel() call to take a rank as argument .
* int init_accel (int rank) {}
3. Add checks to *_accel(int rank) funcs to check if rank is not using a GPU.
* if (rank == 0 && 'H' == options.src) // skip accel init
* if (rank == 1 && 'H' == options.dst) // skip accel init
* Not sure how to handle collectives
Regards,
Adam
Adam Goldman
High Performance Networking
Intel Corporation
adam.goldman at intel.com<mailto:adam.goldman at intel.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20250602/336dea63/attachment-0001.html>
More information about the Mvapich-discuss
mailing list