[mvapich-discuss] code fails running on GPUs attached to different sockets

Hari Subramoni subramoni.1 at osu.edu
Mon Apr 18 08:58:50 EDT 2016


Hello Fabio,

Thanks for your note. Looks like you are using the regular MVAPICH2 version
for your GPU system. Many issues related to GPUs and their various code
paths (including those involving QPIs) have been resolved in the
MVAPICH2-GDR version. This version also has many other features for
performance and scalability. We recommend this version to be used for GPU
systems. Please use the latest MVAPICH2-GDR 2.2b release and let us know if
you see any issues.

Regards,
Hari.
On Apr 18, 2016 2:25 AM, "Sebastiano Fabio Schifano" <schifano at fe.infn.it>
wrote:

> Hi,
>
> we are experimenting some issues on running our "Lattice Boltzmann" code
> on a machine with
> - 2 CPU sockets E5-2630-v3 (Haswell class)
> - 4 K80 per socket
> - 1 IB card per socket
>
> The code updates at each iteration the halo boundaries of the lattice
> portion allocated on each GPU.
>
> The issue we are facing is the following:
>
> - running on two GPUs attached to the same CPU-socket the result is correct
> - running on two GPUs each attached to a different CPU-socket the result
> is wrong
>
> However, if we set MV2_USE_SHARED_MEM=0 the result is correct in both
> cases.
>
> Investigating more the problem we found that it happens only for select
> sizes of halos:
>
> - for halos size 8192 and 16384 (double values) the code fails
> - for halos size 2048,4096,9216,10240,12288 (double values) the result is
> correct
>
> Being LY the size of the halo, the size of MPI communication is
> 26*(LY+6)*8Bytes.
>
> We are running CentOS 7.2, MVAPICH2-2.2 and we are NOT using GDRCOPY.
>
> Any idea how to further investigate this problem ? Any suggestion is
> welcome.
>
> Best Regards
> fabio
>
>
> --
>   ------------------------------------------------------------------
>   Schifano Sebastiano Fabio
>   Department of Mathematics and Computer Science - University of Ferrara
>   c/o Polo Scientifico e Tecnologico, Edificio B stanza 208
>   via Saragat 1, I-44122 Ferrara (Italy)
>   Tel: +39 0532 97 4614
>   -------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160418/da81662e/attachment.html>


More information about the mvapich-discuss mailing list