[mvapich-discuss] code fails running on GPUs attached to different sockets
Sebastiano Fabio Schifano
schifano at fe.infn.it
Mon Apr 18 11:20:55 EDT 2016
Hi Khaled,
leaving MV2_USE_SHARED_MEM set to the default value:
- setting MV2_USE_GPUDIRECT_LOOPBACK=0 the code gives the right result,
- while setting MV2_CUDA_NONBLOCKING_STREAMS=0 the code gives a wrong
result.
Thanks
fabio
On 04/18/2016 04:22 PM, khaled hamidouche wrote:
> Hi Sebastiano,
>
> Can you please try this configurations one by one and see if they help
> fixing the issue
>
> 1) - MV2_USE_GPUDIRECT_LOOPBACK=0
> 2) - MV2_CUDA_NONBLOCKING_STREAMS=0
>
> Thanks
>
> On Mon, Apr 18, 2016 at 10:16 AM, Sebastiano Fabio Schifano
> <schifano at fe.infn.it <mailto:schifano at fe.infn.it>> wrote:
>
> Hi Hari,
>
> NO, we are using the latest MVAPICH2-GDR version 2.2b. Any other
> idea/suggestion ?
>
> Ragards,
> fabio
>
>
> On 04/18/2016 02:58 PM, Hari Subramoni wrote:
>>
>> Hello Fabio,
>>
>> Thanks for your note. Looks like you are using the regular
>> MVAPICH2 version for your GPU system. Many issues related to GPUs
>> and their various code paths (including those involving QPIs)
>> have been resolved in the MVAPICH2-GDR version. This version also
>> has many other features for performance and scalability. We
>> recommend this version to be used for GPU systems. Please use the
>> latest MVAPICH2-GDR 2.2b release and let us know if you see any
>> issues.
>>
>> Regards,
>> Hari.
>>
>> On Apr 18, 2016 2:25 AM, "Sebastiano Fabio Schifano"
>> <schifano at fe.infn.it <mailto:schifano at fe.infn.it>> wrote:
>>
>> Hi,
>>
>> we are experimenting some issues on running our "Lattice
>> Boltzmann" code on a machine with
>> - 2 CPU sockets E5-2630-v3 (Haswell class)
>> - 4 K80 per socket
>> - 1 IB card per socket
>>
>> The code updates at each iteration the halo boundaries of the
>> lattice portion allocated on each GPU.
>>
>> The issue we are facing is the following:
>>
>> - running on two GPUs attached to the same CPU-socket the
>> result is correct
>> - running on two GPUs each attached to a different CPU-socket
>> the result is wrong
>>
>> However, if we set MV2_USE_SHARED_MEM=0 the result is correct
>> in both cases.
>>
>> Investigating more the problem we found that it happens only
>> for select sizes of halos:
>>
>> - for halos size 8192 and 16384 (double values) the code fails
>> - for halos size 2048,4096,9216,10240,12288 (double values)
>> the result is correct
>>
>> Being LY the size of the halo, the size of MPI communication
>> is 26*(LY+6)*8Bytes.
>>
>> We are running CentOS 7.2, MVAPICH2-2.2 and we are NOT using
>> GDRCOPY.
>>
>> Any idea how to further investigate this problem ? Any
>> suggestion is welcome.
>>
>> Best Regards
>> fabio
>>
>>
>> --
>> ------------------------------------------------------------------
>> Schifano Sebastiano Fabio
>> Department of Mathematics and Computer Science - University
>> of Ferrara
>> c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
>> via Saragat 1, I-44122 Ferrara (Italy)
>> Tel: +39 0532 97 4614 <tel:%2B39%200532%2097%204614>
>> -------------------------------------------------------------------
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> <mailto:mvapich-discuss at cse.ohio-state.edu>
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>
> --
> ------------------------------------------------------------------
> Schifano Sebastiano Fabio
> Department of Mathematics and Computer Science - University of Ferrara
> c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
> via Saragat 1, I-44122 Ferrara (Italy)
> Tel: +39 0532 97 4614
> -------------------------------------------------------------------
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
--
------------------------------------------------------------------
Schifano Sebastiano Fabio
Department of Mathematics and Computer Science - University of Ferrara
c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
via Saragat 1, I-44122 Ferrara (Italy)
Tel: +39 0532 97 4614
-------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160418/c84290a6/attachment-0001.html>
More information about the mvapich-discuss
mailing list