[mvapich-discuss] code fails running on GPUs attached to different sockets

Sebastiano Fabio Schifano schifano at fe.infn.it
Mon Apr 18 10:16:20 EDT 2016


Hi Hari,

NO, we are using the latest MVAPICH2-GDR version 2.2b. Any other 
idea/suggestion ?

Ragards,
fabio

On 04/18/2016 02:58 PM, Hari Subramoni wrote:
>
> Hello Fabio,
>
> Thanks for your note. Looks like you are using the regular MVAPICH2 
> version for your GPU system. Many issues related to GPUs and their 
> various code paths (including those involving QPIs) have been resolved 
> in the MVAPICH2-GDR version. This version also has many other features 
> for performance and scalability. We recommend this version to be used 
> for GPU systems. Please use the latest MVAPICH2-GDR 2.2b release and 
> let us know if you see any issues.
>
> Regards,
> Hari.
>
> On Apr 18, 2016 2:25 AM, "Sebastiano Fabio Schifano" 
> <schifano at fe.infn.it <mailto:schifano at fe.infn.it>> wrote:
>
>     Hi,
>
>     we are experimenting some issues on running our "Lattice
>     Boltzmann" code on a machine with
>     - 2 CPU sockets E5-2630-v3 (Haswell class)
>     - 4 K80 per socket
>     - 1 IB card per socket
>
>     The code updates at each iteration the halo boundaries of the
>     lattice portion allocated on each GPU.
>
>     The issue we are facing is the following:
>
>     - running on two GPUs attached to the same CPU-socket the result
>     is correct
>     - running on two GPUs each attached to a different CPU-socket the
>     result is wrong
>
>     However, if we set MV2_USE_SHARED_MEM=0 the result is correct in
>     both cases.
>
>     Investigating more the problem we found that it happens only for
>     select sizes of halos:
>
>     - for halos size 8192 and 16384 (double values) the code fails
>     - for halos size 2048,4096,9216,10240,12288 (double values) the
>     result is correct
>
>     Being LY the size of the halo, the size of MPI communication is
>     26*(LY+6)*8Bytes.
>
>     We are running CentOS 7.2, MVAPICH2-2.2 and we are NOT using GDRCOPY.
>
>     Any idea how to further investigate this problem ? Any suggestion
>     is welcome.
>
>     Best Regards
>     fabio
>
>
>     -- 
>     ------------------------------------------------------------------
>       Schifano Sebastiano Fabio
>       Department of Mathematics and Computer Science - University of
>     Ferrara
>       c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
>       via Saragat 1, I-44122 Ferrara (Italy)
>       Tel: +39 0532 97 4614 <tel:%2B39%200532%2097%204614>
>     -------------------------------------------------------------------
>
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>


-- 
  ------------------------------------------------------------------
  Schifano Sebastiano Fabio
  Department of Mathematics and Computer Science - University of Ferrara
  c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
  via Saragat 1, I-44122 Ferrara (Italy)
  Tel: +39 0532 97 4614
  -------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160418/7f3ed047/attachment-0001.html>


More information about the mvapich-discuss mailing list