[mvapich-discuss] Shared memory in MPI3 - measure of memory footprint

Sat Jan 9 17:13:27 EST 2016

Hi Mingzhe

Thanks a lot for your reply! Yes I did execute MPI_Win_allocate_shared
on each MPI process. And the shared memory is about 700mb.
I have attached the relevant code snipped at the
end of this mail. From the documentation I read that
MPI_Win_allocate_shared  is a collective call, so I do have to call it
from every processes that is supposed to use the shared memory,
right?

I guess the question I am truly asking is: Is there a way to use MPI
shared memory so that the shared memory is being accessible by all
processes but only counted once (like threads in OpenMP)?

Thanks a lot

Benedikt

===== Code sample below ======

    CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED&
, 0, MPI_INFO_NULL, hostcomm,ierr)
    CALL MPI_Comm_rank(hostcomm, hostrank,ierr)

    allocate(arrayshape(4))
    arrayshape=(/ nroot1,nroot1,nroot1,nroot1 /)
    if (hostrank == 0) then
        windowsize = int(nroot1**4,MPI_ADDRESS_KIND)*&
8_MPI_ADDRESS_KIND ! *8 since there are 8 bytes in a double
    else
        windowsize = 0_MPI_ADDRESS_KIND
    end if
    disp_unit = 1

    CALL MPI_Win_allocate_shared(windowsize, disp_unit, &
MPI_INFO_NULL, hostcomm, baseptr, win,ierr)
    CALL MPI_Win_allocate_shared(windowsize, disp_unit, &
MPI_INFO_NULL, hostcomm, baseptr2, win2,ierr)

    ! Obtain the location of the memory segment
    if (hostrank /= 0) then
        CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit,&
 baseptr,ierr)
        CALL MPI_Win_shared_query(win2, 0, windowsize, disp_unit,&
 baseptr2,ierr)
    end if

    ! baseptr can now be associated with a Fortran pointer
    ! and thus used to access the shared data
    CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
    CALL C_F_POINTER(baseptr2, matrix_elementsz,arrayshape)

________________________________
From: mingzhe0908 at gmail.com <mingzhe0908 at gmail.com> on behalf of Mingzhe Li <li.2192 at osu.edu>
Sent: Saturday, January 9, 2016 11:34 AM
To: Brandt, Benedikt B
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re:

Hi Benedikt,

Thanks for your note. Did you allocate around 700mb of shared memory with MPI_Win_allocate_shared for each MPI process? If that's the case, the memory consumption will be the same as using malloc for each MPI process.

Thanks,
Mingzhe

On Fri, Jan 8, 2016 at 11:01 AM, Brandt, Benedikt B <benbra at gatech.edu<mailto:benbra at gatech.edu>> wrote:
Please excuse the terrible formatting of my last mail. This was the
first time I submitted to this list. Here is a well formatted
version:

Dear mvapich community

I am currently testing the MPI-3 shared memory routines for use in our
application. The goal is to reduce the memory footprint of our
application per node.

The code seems to work but I get the following odd behavior when I
monitor the memory usage:

TLDR: Shared memory that is "touched" (read or written) by an MPI
process counts towards that process's real memory (RSS, RES) value. If
every process accesses the whole shared memory (= data), the memory
consumption as seen by top (or other monitoring tools) is the same as
if every process had it's own copy of the data.

If we run this job on a cluster with a job scheduler and resource
manager our jobs will be aborted if we expect the shared memory to
count only once. So how can we work around this problem? How could a
resource manager (or the operating system) correctly determine memory
consumption?

=== Long version: ===

Running our code compiled with mvapich (2.1)  and ifort (15) on one
node, I see the following memory footprint right after starting the
program:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU  %MEM   TIME+  COMMAND
47708 bbrandt6  20   0  746m  14m 6064 R 100.0  0.0   0:22.57 exa
47707 bbrandt6  20   0  746m  14m 6164 R 100.0  0.0   0:22.56 exa
47709 bbrandt6  20   0  746m  14m 6020 R 100.0  0.0   0:22.58 exa
47710 bbrandt6  20   0  746m  14m 6056 R 100.0  0.0   0:22.55 exa
47711 bbrandt6  20   0  746m  14m 6072 R 100.0  0.0   0:22.57 exa

This is as expected since we allocate about 700mb of shared memory
using MPI_Win_allocate_shared. After copying the data into the shared
memory it looks like this

PID   USER      PR  NI  VIRT  RES  SHR S %CPU  %MEM   TIME+  COMMAND
47711 bbrandt6  20   0  746m  17m 6216 R 100.0  0.0   3:01.03 exa
47708 bbrandt6  20   0  746m  17m 6212 R 99.6  0.0   2:40.07 exa
47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9   3:01.33 exa
47709 bbrandt6  20   0  746m  17m 6164 R 98.6  0.0   3:06.72 exa
47710 bbrandt6  20   0  746m  17m 6200 R 98.6  0.0   2:43.91 exa

Again just as expected, one process copied the data and has now a
memory footprint of 746m VIRT and 612m RES. Now the other processes
start accessing the data and we get:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU  %MEM   TIME+  COMMAND
47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.3   3:49.37 exa
47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.0   3:43.93 exa
47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:23.03 exa
47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:26.86 exa
47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:44.01 exa

which increases to 787m VIRT 653m RES for all processes once they have
accessed all the data in the shared memory. So the memory footprint is
just as large as if every process held it's own copy of the data. So
at this point it seems like we haven't saved any memory at all. We
might have gained speed and bandwith but using the shared memory did
not reduce the memory footprint of our application.

If we run this job on a cluster with a job scheduler and resource
manager our jobs will be aborted if we expect the shared memory to
count only once. So how can we work around this problem? Is the cause
of this problem that mvapich runs different processes so shared memory
counts fully towards each whereas openmp runs only one process but
multiple threads so the shared memory counts only once? How could a
resource manager (or the operating system) correctly determine memory
consumption?

=== end long version ===

Any thoughts and any comments are truly appreciated

Thanks a lot

Benedikt

________________________________________
From: Brandt, Benedikt B <benbra at gatech.edu<mailto:benbra at gatech.edu>>
Sent: Friday, January 8, 2016 10:46 AM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
Subject:

--===============3995625848783985513==
Content-Language: en-US
Content-Type: multipart/alternative;
        boundary="_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_"

--_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Dear mvapich community

I am currently testing the MPI-3 shared memory routines for use in our appl=
ication. The goal is to reduce the memory footprint of our application per =
node.

The code seems to work but I get the following odd behavior when I monitor =
the memory usage:

TLDR: Shared memory that is "touched" (read or written) by an MPI process c=
ounts towards that process's real memory (RSS, RES) value. If every process=
 accesses the whole shared memory (=3D data), the memory consumption as see=
n by top (or other monitoring tools) is the same as if every process had it=
's own copy of the data.

If we run this job on a cluster with a job scheduler and resource manager o=
ur jobs will be aborted if we expect the shared memory to count only once. =
So how can we work around this problem? How could a resource manager (or th=
e operating system) correctly determine memory consumption?

=3D=3D=3D Long version: =3D=3D=3D

Running our code compiled with mvapich (2.1)  and ifort (15) on one node, I=
 see the following memory footprint right after starting the program:

PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+  COMM=
AND

47708 bbrandt6  20   0  746m  14m 6064 R 100.0  0.0   0:22.57 exact_ddot_en=
e_
47707 bbrandt6  20   0  746m  14m 6164 R 100.0  0.0   0:22.56 exact_ddot_en=
e_
47709 bbrandt6  20   0  746m  14m 6020 R 100.0  0.0   0:22.58 exact_ddot_en=
e_
47710 bbrandt6  20   0  746m  14m 6056 R 100.0  0.0   0:22.55 exact_ddot_en=
e_
47711 bbrandt6  20   0  746m  14m 6072 R 100.0  0.0   0:22.57 exact_ddot_en=
e_

This is as expected since we allocate about 700mb of shared memory using MP=
I_Win_allocate_shared. After copying the data into the shared memory it loo=
ks like this

PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+  COMM=
AND

47711 bbrandt6  20   0  746m  17m 6216 R 100.0  0.0   3:01.03 exact_ddot_en=
e_
47708 bbrandt6  20   0  746m  17m 6212 R 99.6  0.0   2:40.07 exact_ddot_ene=
_
47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9   3:01.33 exact_ddot_ene=
_
47709 bbrandt6  20   0  746m  17m 6164 R 98.6  0.0   3:06.72 exact_ddot_ene=
_
47710 bbrandt6  20   0  746m  17m 6200 R 98.6  0.0   2:43.91 exact_ddot_ene=
_

Again just as expected, one process copied the data and has now a memory fo=
otprint of 746m VIRT and 612m RES. Now the other processes start accessing =
the data and we get:

PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+  COMM=
AND
47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.3   3:49.37 exact_ddot_en=
e_
47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.0   3:43.93 exact_ddot_en=
e_
47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:23.03 exact_ddot_en=
e_
47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:26.86 exact_ddot_en=
e_
47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:44.01 exact_ddot_en=
e_

which increases to 787m VIRT 653m RES for all processes once they have acce=
ssed all the data in the shared memory. So the memory footprint is just as =
large as if every process held it's own copy of the data. So at this point =
it seems like we haven't saved any memory at all. We might have gained spee=
d and bandwith but using the shared memory did not reduce the memory footpr=
int of our application.

If we run this job on a cluster with a job scheduler and resource manager o=
ur jobs will be aborted if we expect the shared memory to count only once. =
So how can we work around this problem? Is the cause of this problem that m=
vapich runs different processes so shared memory counts fully towards each =
whereas openmp runs only one process but multiple threads so the shared mem=
ory counts only once? How could a resource manager (or the operating system=
) correctly determine memory consumption?

=3D=3D=3D end long version =3D=3D=3D

Any thoughts and any comments are truly appreciated

Thanks a lot

Benedikt

--_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
1">
<style type=3D"text/css" style=3D"display:none;"><!-- P {margin-top:0;margi=
n-bottom:0;} --></style>
</head>
<body dir=3D"ltr">
<div id=3D"divtagdefaultwrapper" style=3D"font-size:12pt;color:#000000;back=
ground-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">
<p>Dear mvapich community</p>
<p><br>
</p>
<p>I am currently testing the MPI-3 shared memory routines for use in our a=
pplication. The goal is to reduce the memory footprint of our application p=
er node. </p>
<p><br>
</p>
<p>The code seems to work but I get the following odd behavior when I =
monitor the memory usage:</p>
<p><br>
</p>
<p>TLDR: Shared memory that is "touched" <span style=3D"font=
-family: Calibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe=
 UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbol=
s; font-size: 16px;">(read or written</span><span style=3D"font-family: Cal=
ibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', =
NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size=
: 16px;">)</span>
 by an MPI process counts towards that process's real memory (RSS, RES)&nbs=
p;value. If every process accesses the whole shared memory (=3D d=
ata), the memory consumption as seen by top (or other monitoring tools) is =
the same as if every process had it's own copy of
 the data. </p>
<p><br>
</p>
<p><span style=3D"font-family: Calibri, Arial, Helvetica, sans-serif, 'Appl=
e Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Andro=
id Emoji', EmojiSymbols; font-size: 16px;">If we run this job on a cluster =
with a job scheduler and resource
 manager </span><span style=3D"font-family: Calibri, Arial, Helvetica,=
 sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe =
UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;">our jobs will =
be aborted if we expect the shared memory
 to count only once. So how can we work around this problem?</span> <s=
pan style=3D"font-family: Calibri, Arial, Helvetica, sans-serif, 'Apple Col=
or Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Em=
oji', EmojiSymbols; font-size: 16px;">How
 could a resource manager (or the operating system</span><span style=3D"fon=
t-family: Calibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Sego=
e UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbo=
ls; font-size: 16px;">) correctly
 determine memory consumption?</span> </p>
<p><br>
</p>
<p>=3D=3D=3D Long version: =3D=3D=3D</p>
<p><br>
</p>
<p>Running our code compiled with mvapich (2.1)  and ifort (15) o=
n one node, I see the following memory footprint right after starting the p=
rogram:</p>
<p><span style=3D"font-size: 12pt;"><br>
</span></p>
<p><span style=3D"font-size: 12pt;">PID      USER   &nb=
sp;     PR  NI  VIRT    RES  SHR   =
S %CPU %MEM    TIME+  COMMAND</span><br>
</p>
<p></p>
<div></div>
<div>47708 bbrandt6  20   0  746m  14m 6064 R 100.0 &nb=
sp;0.0   0:22.57 exact_ddot_ene_<span style=3D"font-size: 12pt;"></spa=
n></div>
<div>47707 bbrandt6  20   0  746m  14m 6164 R 100.0 &nb=
sp;0.0   0:22.56 exact_ddot_ene_</div>
<div>47709 bbrandt6  20   0  746m  14m 6020 R 100.0 &nb=
sp;0.0   0:22.58 exact_ddot_ene_</div>
<div>47710 bbrandt6  20   0  746m  14m 6056 R 100.0 &nb=
sp;0.0   0:22.55 exact_ddot_ene_</div>
<div><span style=3D"font-size: 12pt;">47711 bbrandt6  20   0 &nbs=
p;746m  14m 6072 R 100.0  0.0   0:22.57 exact_ddot_ene_</spa=
n></div>
<div><span style=3D"font-size: 12pt;"><br>
</span></div>
<p></p>
<p>This is as expected since we allocate about 700mb of shared memory using=
 MPI_Win_allocate_shared. A<span style=3D"font-size: 12pt;">fter copyi=
ng the data into the shared memory it looks like this</span></p>
<p><span style=3D"font-size: 12pt;"><br>
</span></p>
<p><span style=3D"font-size: 12pt;"><span style=3D"font-family: Calibri, Ar=
ial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColo=
rEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;"=
>PID      USER         PR  NI &nbsp=
;VIRT
    RES  SHR   S %CPU %MEM    TIME+ &nbsp=
;COMMAND</span><br>
</span></p>
<p></p>
<div>47711 bbrandt6  20   0  746m  17m 6216 R 100.0 &nb=
sp;0.0   3:01.03 exact_ddot_ene_<span style=3D"font-size: 12pt;"></spa=
n></div>
<div>47708 bbrandt6  20   0  746m  17m 6212 R 99.6 &nbs=
p;0.0   2:40.07 exact_ddot_ene_</div>
<div>47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9=
   3:01.33 exact_ddot_ene_</div>
<div>47709 bbrandt6  20   0  746m  17m 6164 R 98.6 &nbs=
p;0.0   3:06.72 exact_ddot_ene_</div>
<div>47710 bbrandt6  20   0  746m  17m 6200 R 98.6 &nbs=
p;0.0   2:43.91 exact_ddot_ene_</div>
<div><br>
</div>
<div>Again just as expected, one process copied the data and has now a=
 memory footprint of<span style=3D"font-family: Calibri, Arial, Helvetica, =
sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe U=
I Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;"> 746m
 VIRT and 612m RES. Now the other processes start accessing the data a=
nd we get:</span></div>
<div><span style=3D"font-family: Calibri, Arial, Helvetica, sans-serif, 'Ap=
ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'And=
roid Emoji', EmojiSymbols; font-size: 16px;"><br>
</span></div>
<div><span style=3D"font-family: Calibri, Arial, Helvetica, sans-serif, 'Ap=
ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'And=
roid Emoji', EmojiSymbols; font-size: 16px;"><span style=3D"font-family: Ca=
libri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji',=
 NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-siz=
e: 16px;">PID
      USER         PR  NI  VIR=
T    RES  SHR   S %CPU %MEM    TIME+ &nbs=
p;COMMAND</span><br>
</span></div>
<div><span style=3D"font-family: Calibri, Arial, Helvetica, sans-serif, 'Ap=
ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'And=
roid Emoji', EmojiSymbols; font-size: 16px;">
<div>47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.=
3   3:49.37 exact_ddot_ene_</div>
<div>47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.=
0   3:43.93 exact_ddot_ene_</div>
<div>47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.=
3   3:23.03 exact_ddot_ene_</div>
<div>47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.=
3   3:26.86 exact_ddot_ene_</div>
<div>47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.=
3   3:44.01 exact_ddot_ene_</div>
<div><br>
</div>
<div>which increases to <span style=3D"font-family: Calibri, Arial, He=
lvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji,=
 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;">787m V=
IRT 653m RES for all processes once they have
 accessed all the data in the shared memory. So the memory footprint is jus=
t as large as if every process held it's own copy of the data. So at this p=
oint it seems like we haven't saved any memory at all. We might have gained=
 speed and bandwith but using the
 shared memory did not reduce the memory footprint of our application.&nbsp=
;</span></div>
<div><br>
</div>
<div>If we run this job on a cluster with a job scheduler and resource=
 manager our jobs will be aborted if we expect the shared memory to co=
unt only once. So how can we work around this problem? Is the cause of this=
 problem that mvapich runs different processes
 so shared memory counts fully towards each whereas openmp runs only one pr=
ocess but multiple threads so the shared memory counts only once? How =
could a resource manager (or the operating system) correctly determine memo=
ry consumption? </div>
<div><br>
</div>
<div>=3D=3D=3D end long version =3D=3D=3D</div>
<div><br>
</div>
<div>Any thoughts and any comments are truly appreciated</div>
<div><br>
</div>
<div>Thanks a lot</div>
<div><br>
</div>
<div>Benedikt</div>
<div></div>
<br>
</span></div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<br>
<p></p>
</div>
</body>
</html>

--_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_--

--===============3995625848783985513==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

--===============3995625848783985513==--
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160109/c5410da9/attachment-0001.html>