[mvapich-discuss] problem with mvapich2.ofa + CUDA

Tue Mar 11 16:38:13 EDT 2008

Hi all -

I have an application which is using MPI over infiniband and which
also uses CUDA on NVIDIA graphics cards.  The program can be
configured with and without MPI (without limits the process to a
single node), and which can run with GPUs and without GPUs.

The problem appears when I am using MPI over IB and GPUs:
Essentially, malloc of GPU memory fails in this configuration.  If I
use mvapich2 over tcp this problem doesn't show up (even though I am
running IP over IB).  Likewise, if I don't use GPUs, my program works
fine.

A bit more detail about the memory malloc failing:  The function is
cudaMalloc(), available through the CUDA runtime libraries.  I can
actually get these calls to succeed until a certain stage in my
program, which happens to be after several dynamic libraries are
opened via dlopen, and after spawning a thread.

I am running mvapich2 in multithreaded mode, calling MPI_Init_thread()
soon after main() is entered.

I have tried some fairly minimal reproduction cases, and I can't seem
to make them fail.  I may have to try something a bit more
complicated.  However, in the meantime, can anyone suggest what might
be broken?  Perhaps I've misconfigured mvapich with IB?

I'm running gentoo linux with kernel version 2.6.24, and infiniband is
built into the kernel along with the mellanox drivers (I have
infinihost cards).

Thanks for any suggestions,
  Brian