[mvapich-discuss] code fails running on GPUs attached to different sockets
Sebastiano Fabio Schifano
schifano at fe.infn.it
Mon Apr 18 02:24:25 EDT 2016
Hi,
we are experimenting some issues on running our "Lattice Boltzmann" code
on a machine with
- 2 CPU sockets E5-2630-v3 (Haswell class)
- 4 K80 per socket
- 1 IB card per socket
The code updates at each iteration the halo boundaries of the lattice
portion allocated on each GPU.
The issue we are facing is the following:
- running on two GPUs attached to the same CPU-socket the result is correct
- running on two GPUs each attached to a different CPU-socket the result
is wrong
However, if we set MV2_USE_SHARED_MEM=0 the result is correct in both cases.
Investigating more the problem we found that it happens only for select
sizes of halos:
- for halos size 8192 and 16384 (double values) the code fails
- for halos size 2048,4096,9216,10240,12288 (double values) the result
is correct
Being LY the size of the halo, the size of MPI communication is
26*(LY+6)*8Bytes.
We are running CentOS 7.2, MVAPICH2-2.2 and we are NOT using GDRCOPY.
Any idea how to further investigate this problem ? Any suggestion is
welcome.
Best Regards
fabio
--
------------------------------------------------------------------
Schifano Sebastiano Fabio
Department of Mathematics and Computer Science - University of Ferrara
c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
via Saragat 1, I-44122 Ferrara (Italy)
Tel: +39 0532 97 4614
-------------------------------------------------------------------
More information about the mvapich-discuss
mailing list