[mvapich-discuss] code fails running on GPUs attached to different sockets

Sebastiano Fabio Schifano schifano at fe.infn.it
Mon Apr 18 02:24:25 EDT 2016


Hi,

we are experimenting some issues on running our "Lattice Boltzmann" code 
on a machine with
- 2 CPU sockets E5-2630-v3 (Haswell class)
- 4 K80 per socket
- 1 IB card per socket

The code updates at each iteration the halo boundaries of the lattice 
portion allocated on each GPU.

The issue we are facing is the following:

- running on two GPUs attached to the same CPU-socket the result is correct
- running on two GPUs each attached to a different CPU-socket the result 
is wrong

However, if we set MV2_USE_SHARED_MEM=0 the result is correct in both cases.

Investigating more the problem we found that it happens only for select 
sizes of halos:

- for halos size 8192 and 16384 (double values) the code fails
- for halos size 2048,4096,9216,10240,12288 (double values) the result 
is correct

Being LY the size of the halo, the size of MPI communication is 
26*(LY+6)*8Bytes.

We are running CentOS 7.2, MVAPICH2-2.2 and we are NOT using GDRCOPY.

Any idea how to further investigate this problem ? Any suggestion is 
welcome.

Best Regards
fabio


-- 
   ------------------------------------------------------------------
   Schifano Sebastiano Fabio
   Department of Mathematics and Computer Science - University of Ferrara
   c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
   via Saragat 1, I-44122 Ferrara (Italy)
   Tel: +39 0532 97 4614
   -------------------------------------------------------------------



More information about the mvapich-discuss mailing list