[mvapich-discuss] code fails running on GPUs attached to different sockets

Sebastiano Fabio Schifano schifano at fe.infn.it
Mon Apr 18 11:20:55 EDT 2016


Hi Khaled,

leaving MV2_USE_SHARED_MEM set to the default value:

- setting MV2_USE_GPUDIRECT_LOOPBACK=0 the code gives the right result,
- while setting MV2_CUDA_NONBLOCKING_STREAMS=0 the code gives a wrong 
result.

Thanks
fabio



On 04/18/2016 04:22 PM, khaled hamidouche wrote:
> Hi Sebastiano,
>
> Can you please try this configurations one by one and see if they help 
> fixing the issue
>
> 1) - MV2_USE_GPUDIRECT_LOOPBACK=0
> 2) - MV2_CUDA_NONBLOCKING_STREAMS=0
>
> Thanks
>
> On Mon, Apr 18, 2016 at 10:16 AM, Sebastiano Fabio Schifano 
> <schifano at fe.infn.it <mailto:schifano at fe.infn.it>> wrote:
>
>     Hi Hari,
>
>     NO, we are using the latest MVAPICH2-GDR version 2.2b. Any other
>     idea/suggestion ?
>
>     Ragards,
>     fabio
>
>
>     On 04/18/2016 02:58 PM, Hari Subramoni wrote:
>>
>>     Hello Fabio,
>>
>>     Thanks for your note. Looks like you are using the regular
>>     MVAPICH2 version for your GPU system. Many issues related to GPUs
>>     and their various code paths (including those involving QPIs)
>>     have been resolved in the MVAPICH2-GDR version. This version also
>>     has many other features for performance and scalability. We
>>     recommend this version to be used for GPU systems. Please use the
>>     latest MVAPICH2-GDR 2.2b release and let us know if you see any
>>     issues.
>>
>>     Regards,
>>     Hari.
>>
>>     On Apr 18, 2016 2:25 AM, "Sebastiano Fabio Schifano"
>>     <schifano at fe.infn.it <mailto:schifano at fe.infn.it>> wrote:
>>
>>         Hi,
>>
>>         we are experimenting some issues on running our "Lattice
>>         Boltzmann" code on a machine with
>>         - 2 CPU sockets E5-2630-v3 (Haswell class)
>>         - 4 K80 per socket
>>         - 1 IB card per socket
>>
>>         The code updates at each iteration the halo boundaries of the
>>         lattice portion allocated on each GPU.
>>
>>         The issue we are facing is the following:
>>
>>         - running on two GPUs attached to the same CPU-socket the
>>         result is correct
>>         - running on two GPUs each attached to a different CPU-socket
>>         the result is wrong
>>
>>         However, if we set MV2_USE_SHARED_MEM=0 the result is correct
>>         in both cases.
>>
>>         Investigating more the problem we found that it happens only
>>         for select sizes of halos:
>>
>>         - for halos size 8192 and 16384 (double values) the code fails
>>         - for halos size 2048,4096,9216,10240,12288 (double values)
>>         the result is correct
>>
>>         Being LY the size of the halo, the size of MPI communication
>>         is 26*(LY+6)*8Bytes.
>>
>>         We are running CentOS 7.2, MVAPICH2-2.2 and we are NOT using
>>         GDRCOPY.
>>
>>         Any idea how to further investigate this problem ? Any
>>         suggestion is welcome.
>>
>>         Best Regards
>>         fabio
>>
>>
>>         -- 
>>         ------------------------------------------------------------------
>>           Schifano Sebastiano Fabio
>>           Department of Mathematics and Computer Science - University
>>         of Ferrara
>>           c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
>>           via Saragat 1, I-44122 Ferrara (Italy)
>>           Tel: +39 0532 97 4614 <tel:%2B39%200532%2097%204614>
>>         -------------------------------------------------------------------
>>
>>         _______________________________________________
>>         mvapich-discuss mailing list
>>         mvapich-discuss at cse.ohio-state.edu
>>         <mailto:mvapich-discuss at cse.ohio-state.edu>
>>         http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>
>     -- 
>       ------------------------------------------------------------------
>       Schifano Sebastiano Fabio
>       Department of Mathematics and Computer Science - University of Ferrara
>       c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
>       via Saragat 1, I-44122 Ferrara (Italy)
>       Tel: +39 0532 97 4614
>       -------------------------------------------------------------------
>
>
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
  ------------------------------------------------------------------
  Schifano Sebastiano Fabio
  Department of Mathematics and Computer Science - University of Ferrara
  c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
  via Saragat 1, I-44122 Ferrara (Italy)
  Tel: +39 0532 97 4614
  -------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160418/c84290a6/attachment-0001.html>


More information about the mvapich-discuss mailing list