[mvapich-discuss] Hang in CH3 SMP Rendezvous protocol w/ CUDA w/o Infiniband

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Jan 29 09:17:16 EST 2015


On Thu, Jan 22, 2015 at 05:46:19PM -0500, Paul Sathre wrote:
> Hi Khaled,
> 
> Thanks for the feedback, what additional information would be most useful?
> full config.log or some subset? /proc/cpuinfo? Something else?
> 
> I've dug a little deeper and tried two other non-Infiniband systems I have
> access to, both of which succeed. (With a modified configure line to point
> to a userspace build of libibverbs.so v1.1.8-1 from the Debian repos and
> non-standard CUDA 6.0 path:
> 
> ../mvapich2-2.1rc1/configure
> --prefix=/home/psath/mvapich2-2.1rc1/build/install --enable-cuda
> --disable-mcast --with-ib-libpath=/home/psath/libibverbs/install/lib
> --with-ib-include=/home/psath/libibverbs/install/include
> --with-libcuda=/usr/local/cuda-6.0/lib64
> --with-libcudart=/usr/local/cuda-6.0/lib64/
> )
> 
> One successful system has dual K20Xm's running Nvidia driver version 331.67
> 
> The other has a single C2070 running the same Nvidia driver.
> 
> The hanging system has 4x Tesla C2070s running Nvidia driver 319.32 and
> libibverbs 1.1.6 (I have tested swapping in libibverbs 1.1.8 and gcc 4.8 to
> make it more like the successful systems, to no avail. Vimdiff examination
> of the config.log of the failing system vs. either succeeding system shows
> no significant changes.)

Hi Paul.  Thanks for pointing out the version of the NVIDIA driver and
sorry that I didn't see this earlier.  You'll need to update this to
331.20 or later to get things working.  Please let us know if you have
any more questions or face any more issues.

-- 
Jonathan Perkins


More information about the mvapich-discuss mailing list