[mvapich-discuss] Shared memory in MPI3 - measure of memory footprint

Sun Jan 10 19:28:06 EST 2016

Hi Mingzhe

Yes that's possible for our application and we are doing this currently:

 if (hostrank == 0) then
        windowsize = int(nroot1**4,MPI_ADDRESS_KIND)*&
8_MPI_ADDRESS_KIND ! *8 since there are 8 bytes in a double
    else
        windowsize = 0_MPI_ADDRESS_KIND
 end if

for a more complete listing see: http://pastebin.com/0bGBYqbE

So the windowsize is 0 for all mpi processes on a node except for the one with hostrank==0. I tried this with both impi 5.1 and mvapich 2.1, both show the behavior mentioned earlier. 

Again I really appreciate all comments and ideas.

Thanks

Benedikt

From: mingzhe0908 at gmail.com <mingzhe0908 at gmail.com> on behalf of Mingzhe Li <li.2192 at osu.edu>
Sent: Saturday, January 9, 2016 7:30 PM
To: Brandt, Benedikt B
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re:

Hi Benedikt,

You are welcome. The window size for each process could be different. If 700mb of shared memory is all what you need per node, you could specify window size=0 for some MPI processes. Is this possible for your application?

Thanks,
Mingzhe 

On Sat, Jan 9, 2016 at 5:13 PM, Brandt, Benedikt B  <benbra at gatech.edu> wrote:
 --===============2141258889965369599==
Content-Language: en-US
Content-Type: multipart/alternative;
        boundary="_000_SN1PR0701MB1856ED6BF530B034B1297011A4F70SN1PR0701MB1856_"

--_000_SN1PR0701MB1856ED6BF530B034B1297011A4F70SN1PR0701MB1856_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi Mingzhe

Thanks a lot for your reply! Yes I did execute MPI_Win_allocate_shared
on each MPI process. And the shared memory is about 700mb.
I have attached the relevant code snipped at the
end of this mail. From the documentation I read that
MPI_Win_allocate_shared  is a collective call, so I do have to call it
from every processes that is supposed to use the shared memory,
right?

I guess the question I am truly asking is: Is there a way to use MPI
shared memory so that the shared memory is being accessible by all
processes but only counted once (like threads in OpenMP)?

Thanks a lot

Benedikt

=3D=3D=3D=3D=3D Code sample below =3D=3D=3D=3D=3D=3D

    CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED&
, 0, MPI_INFO_NULL, hostcomm,ierr)
    CALL MPI_Comm_rank(hostcomm, hostrank,ierr)

    allocate(arrayshape(4))
    arrayshape=3D(/ nroot1,nroot1,nroot1,nroot1 /)
    if (hostrank =3D=3D 0) then
        windowsize =3D int(nroot1**4,MPI_ADDRESS_KIND)*&
8_MPI_ADDRESS_KIND ! *8 since there are 8 bytes in a double
    else
        windowsize =3D 0_MPI_ADDRESS_KIND
    end if
    disp_unit =3D 1

    CALL MPI_Win_allocate_shared(windowsize, disp_unit, &
MPI_INFO_NULL, hostcomm, baseptr, win,ierr)
    CALL MPI_Win_allocate_shared(windowsize, disp_unit, &
MPI_INFO_NULL, hostcomm, baseptr2, win2,ierr)

    ! Obtain the location of the memory segment
    if (hostrank /=3D 0) then
        CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit,&
 baseptr,ierr)
        CALL MPI_Win_shared_query(win2, 0, windowsize, disp_unit,&
 baseptr2,ierr)
    end if

    ! baseptr can now be associated with a Fortran pointer
    ! and thus used to access the shared data
    CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)
    CALL C_F_POINTER(baseptr2, matrix_elementsz,arrayshape)

________________________________
From: mingzhe0908 at gmail.com <mingzhe0908 at gmail.com> on behalf of Mingzhe Li=
 <li.2192 at osu.edu>
Sent: Saturday, January 9, 2016 11:34 AM
To: Brandt, Benedikt B
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re:

Hi Benedikt,

Thanks for your note. Did you allocate around 700mb of shared memory with M=
PI_Win_allocate_shared for each MPI process? If that's the case, the memory=
 consumption will be the same as using malloc for each MPI process.

Thanks,
Mingzhe

On Fri, Jan 8, 2016 at 11:01 AM, Brandt, Benedikt B <benbra at gatech.edu<mail=
to:benbra at gatech.edu>> wrote:
Please excuse the terrible formatting of my last mail. This was the
first time I submitted to this list. Here is a well formatted
version:

Dear mvapich community

I am currently testing the MPI-3 shared memory routines for use in our
application. The goal is to reduce the memory footprint of our
application per node.

The code seems to work but I get the following odd behavior when I
monitor the memory usage:

TLDR: Shared memory that is "touched" (read or written) by an MPI
process counts towards that process's real memory (RSS, RES) value. If
every process accesses the whole shared memory (=3D data), the memory
consumption as seen by top (or other monitoring tools) is the same as
if every process had it's own copy of the data.

If we run this job on a cluster with a job scheduler and resource
manager our jobs will be aborted if we expect the shared memory to
count only once. So how can we work around this problem? How could a
resource manager (or the operating system) correctly determine memory
consumption?

=3D=3D=3D Long version: =3D=3D=3D

Running our code compiled with mvapich (2.1)  and ifort (15) on one

node, I see the following memory footprint right after starting the
program:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU  %MEM   TIME+  COMMAND
47708 bbrandt6  20   0  746m  14m 6064 R 100.0  0.0   0:22.57 exa
47707 bbrandt6  20   0  746m  14m 6164 R 100.0  0.0   0:22.56 exa
47709 bbrandt6  20   0  746m  14m 6020 R 100.0  0.0   0:22.58 exa
47710 bbrandt6  20   0  746m  14m 6056 R 100.0  0.0   0:22.55 exa
47711 bbrandt6  20   0  746m  14m 6072 R 100.0  0.0   0:22.57 exa

This is as expected since we allocate about 700mb of shared memory
using MPI_Win_allocate_shared. After copying the data into the shared
memory it looks like this

PID   USER      PR  NI  VIRT  RES  SHR S %CPU  %MEM   TIME+  COMMAND
47711 bbrandt6  20   0  746m  17m 6216 R 100.0  0.0   3:01.03 exa
47708 bbrandt6  20   0  746m  17m 6212 R 99.6  0.0   2:40.07 exa
47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9   3:01.33 exa
47709 bbrandt6  20   0  746m  17m 6164 R 98.6  0.0   3:06.72 exa
47710 bbrandt6  20   0  746m  17m 6200 R 98.6  0.0   2:43.91 exa

Again just as expected, one process copied the data and has now a
memory footprint of 746m VIRT and 612m RES. Now the other processes
start accessing the data and we get:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU  %MEM   TIME+  COMMAND
47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.3   3:49.37 exa
47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.0   3:43.93 exa
47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:23.03 exa
47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:26.86 exa
47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:44.01 exa

which increases to 787m VIRT 653m RES for all processes once they have
accessed all the data in the shared memory. So the memory footprint is
just as large as if every process held it's own copy of the data. So
at this point it seems like we haven't saved any memory at all. We
might have gained speed and bandwith but using the shared memory did
not reduce the memory footprint of our application.

If we run this job on a cluster with a job scheduler and resource
manager our jobs will be aborted if we expect the shared memory to
count only once. So how can we work around this problem? Is the cause
of this problem that mvapich runs different processes so shared memory
counts fully towards each whereas openmp runs only one process but
multiple threads so the shared memory counts only once? How could a
resource manager (or the operating system) correctly determine memory
consumption?

  =3D=3D=3D end long version =3D=3D=3D

Any thoughts and any comments are truly appreciated

Thanks a lot

Benedikt

________________________________________
From: Brandt, Benedikt B <benbra at gatech.edu<mailto:benbra at gatech.edu>>
Sent: Friday, January 8, 2016 10:46 AM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-stat=
e.edu>
Subject:

--=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D3995625848783985513=3D=3D
Content-Language: en-US
Content-Type: multipart/alternative;
        boundary=3D"_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701M=
B1856_"

--_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_
Content-Type: text/plain; charset=3D"iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Dear mvapich community

I am currently testing the MPI-3 shared memory routines for use in our appl=
=3D
ication. The goal is to reduce the memory footprint of our application per =
=3D
node.

The code seems to work but I get the following odd behavior when I monitor =
=3D
the memory usage:

TLDR: Shared memory that is "touched" (read or written) by an MPI process c=
=3D
ounts towards that process's real memory (RSS, RES) value. If every process=
=3D
 accesses the whole shared memory (=3D3D data), the memory consumption as s=
ee=3D
n by top (or other monitoring tools) is the same as if every process had it=
=3D
's own copy of the data.

If we run this job on a cluster with a job scheduler and resource manager o=
=3D
ur jobs will be aborted if we expect the shared memory to count only once. =
=3D
So how can we work around this problem? How could a resource manager (or th=
=3D
e operating system) correctly determine memory consumption?

=3D3D=3D3D=3D3D Long version: =3D3D=3D3D=3D3D

Running our code compiled with mvapich (2.1)  and ifort (15) on one node, I=
=3D
 see the following memory footprint right after starting the program:

PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+  COMM=
=3D
AND

47708 bbrandt6  20   0  746m  14m 6064 R 100.0  0.0   0:22.57 exact_ddot_en=
=3D
e_
47707 bbrandt6  20   0  746m  14m 6164 R 100.0  0.0   0:22.56 exact_ddot_en=
=3D
e_
47709 bbrandt6  20   0  746m  14m 6020 R 100.0  0.0   0:22.58 exact_ddot_en=
=3D
e_
47710 bbrandt6  20   0  746m  14m 6056 R 100.0  0.0   0:22.55 exact_ddot_en=
=3D
e_
47711 bbrandt6  20   0  746m  14m 6072 R 100.0  0.0   0:22.57 exact_ddot_en=
=3D
e_

This is as expected since we allocate about 700mb of shared memory using MP=
=3D
I_Win_allocate_shared. After copying the data into the shared memory it loo=
=3D
ks like this

PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+  COMM=
=3D
AND

47711 bbrandt6  20   0  746m  17m 6216 R 100.0  0.0   3:01.03 exact_ddot_en=
=3D
e_
47708 bbrandt6  20   0  746m  17m 6212 R 99.6  0.0   2:40.07 exact_ddot_ene=
=3D
_
47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9   3:01.33 exact_ddot_ene=
=3D
_
47709 bbrandt6  20   0  746m  17m 6164 R 98.6  0.0   3:06.72 exact_ddot_ene=
=3D
_
47710 bbrandt6  20   0  746m  17m 6200 R 98.6  0.0   2:43.91 exact_ddot_ene=
=3D
_

Again just as expected, one process copied the data and has now a memory fo=
=3D
otprint of 746m VIRT and 612m RES. Now the other processes start accessing =
=3D
the data and we get:

PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+  COMM=
=3D
AND
47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.3   3:49.37 exact_ddot_en=
=3D
e_
47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.0   3:43.93 exact_ddot_en=
=3D
e_
47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:23.03 exact_ddot_en=
=3D
e_
47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:26.86 exact_ddot_en=
=3D
e_
47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:44.01 exact_ddot_en=
=3D
e_

which increases to 787m VIRT 653m RES for all processes once they have acce=
=3D
ssed all the data in the shared memory. So the memory footprint is just as =
=3D
large as if every process held it's own copy of the data. So at this point =
=3D
it seems like we haven't saved any memory at all. We might have gained spee=
=3D
d and bandwith but using the shared memory did not reduce the memory footpr=
=3D
int of our application.

If we run this job on a cluster with a job scheduler and resource manager o=
=3D
ur jobs will be aborted if we expect the shared memory to count only once. =
=3D
So how can we work around this problem? Is the cause of this problem that m=
=3D
vapich runs different processes so shared memory counts fully towards each =
=3D
whereas openmp runs only one process but multiple threads so the shared mem=
=3D
ory counts only once? How could a resource manager (or the operating system=
=3D
) correctly determine memory consumption?

=3D3D=3D3D=3D3D end long version =3D3D=3D3D=3D3D

Any thoughts and any comments are truly appreciated

Thanks a lot

Benedikt

--_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_
Content-Type: text/html; charset=3D"iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D3D"Content-Type" content=3D3D"text/html; charset=3D3Diso=
-8859-=3D
1">
<style type=3D3D"text/css" style=3D3D"display:none;"><!-- P {margin-top:0;m=
argi=3D
n-bottom:0;} --></style>
</head>
<body dir=3D3D"ltr">
<div id=3D3D"divtagdefaultwrapper" style=3D3D"font-size:12pt;color:#000000;=
back=3D
ground-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">
<p>Dear mvapich community</p>
<p><br>
</p>
<p>I am currently testing the MPI-3 shared memory routines for use in our a=
=3D
pplication. The goal is to reduce the memory footprint of our application p=
=3D
er node. </p>
<p><br>
</p>
<p>The code seems to work but I get the following odd behavior when I =
=3D
monitor the memory usage:</p>
<p><br>
</p>
<p>TLDR: Shared memory that is "touched" <span style=3D3D"fo=
nt=3D
-family: Calibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe=
=3D
 UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbol=
=3D
s; font-size: 16px;">(read or written</span><span style=3D3D"font-family: C=
al=3D
ibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', =
=3D
NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size=
=3D
: 16px;">)</span>
 by an MPI process counts towards that process's real memory (RSS, RES)&nbs=
=3D
p;value. If every process accesses the whole shared memory (=3D3D=
 d=3D
ata), the memory consumption as seen by top (or other monitoring tools) is =
=3D
the same as if every process had it's own copy of
 the data. </p>
<p><br>
</p>
<p><span style=3D3D"font-family: Calibri, Arial, Helvetica, sans-serif, 'Ap=
pl=3D
e Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Andro=
=3D
id Emoji', EmojiSymbols; font-size: 16px;">If we run this job on a cluster =
=3D
with a job scheduler and resource
 manager </span><span style=3D3D"font-family: Calibri, Arial, Helvetic=
a,=3D
 sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe =
=3D
UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;">our jobs will =
=3D
be aborted if we expect the shared memory
 to count only once. So how can we work around this problem?</span> <s=
=3D
pan style=3D3D"font-family: Calibri, Arial, Helvetica, sans-serif, 'Apple C=
ol=3D
or Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Em=
=3D
oji', EmojiSymbols; font-size: 16px;">How
 could a resource manager (or the operating system</span><span style=3D3D"f=
on=3D
t-family: Calibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Sego=
=3D
e UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbo=
=3D
ls; font-size: 16px;">) correctly
 determine memory consumption?</span> </p>
<p><br>
</p>
<p>=3D3D=3D3D=3D3D Long version: =3D3D=3D3D=3D3D</p>
<p><br>
</p>
<p>Running our code compiled with mvapich (2.1)  and ifort (15) o=
=3D
n one node, I see the following memory footprint right after starting the p=
=3D
rogram:</p>
<p><span style=3D3D"font-size: 12pt;"><br>
</span></p>
<p><span style=3D3D"font-size: 12pt;">PID      USER   &=
nb=3D
sp;     PR  NI  VIRT    RES  SHR   =
=3D
S %CPU %MEM    TIME+  COMMAND</span><br>
</p>
<p></p>
<div></div>
<div>47708 bbrandt6  20   0  746m  14m 6064 R 100.0 &nb=
=3D
sp;0.0   0:22.57 exact_ddot_ene_<span style=3D3D"font-size: 12pt;"></s=
pa=3D
n></div>
<div>47707 bbrandt6  20   0  746m  14m 6164 R 100.0 &nb=
=3D
sp;0.0   0:22.56 exact_ddot_ene_</div>
<div>47709 bbrandt6  20   0  746m  14m 6020 R 100.0 &nb=
=3D
sp;0.0   0:22.58 exact_ddot_ene_</div>
<div>47710 bbrandt6  20   0  746m  14m 6056 R 100.0 &nb=
=3D
sp;0.0   0:22.55 exact_ddot_ene_</div>
<div><span style=3D3D"font-size: 12pt;">47711 bbrandt6  20   0 &n=
bs=3D
p;746m  14m 6072 R 100.0  0.0   0:22.57 exact_ddot_ene_</spa=
=3D
n></div>
<div><span style=3D3D"font-size: 12pt;"><br>
</span></div>
<p></p>
<p>This is as expected since we allocate about 700mb of shared memory using=
=3D
 MPI_Win_allocate_shared. A<span style=3D3D"font-size: 12pt;">fter cop=
yi=3D
ng the data into the shared memory it looks like this</span></p>
<p><span style=3D3D"font-size: 12pt;"><br>
</span></p>
<p><span style=3D3D"font-size: 12pt;"><span style=3D3D"font-family: Calibri=
, Ar=3D
ial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColo=
=3D
rEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;"=
=3D
>PID      USER         PR  NI &nbsp=
=3D
;VIRT
    RES  SHR   S %CPU %MEM    TIME+ &nbsp=
=3D
;COMMAND</span><br>
</span></p>
<p></p>
<div>47711 bbrandt6  20   0  746m  17m 6216 R 100.0 &nb=
=3D
sp;0.0   3:01.03 exact_ddot_ene_<span style=3D3D"font-size: 12pt;"></s=
pa=3D
n></div>
<div>47708 bbrandt6  20   0  746m  17m 6212 R 99.6 &nbs=
=3D
p;0.0   2:40.07 exact_ddot_ene_</div>
<div>47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9=
=3D
   3:01.33 exact_ddot_ene_</div>
<div>47709 bbrandt6  20   0  746m  17m 6164 R 98.6 &nbs=
=3D
p;0.0   3:06.72 exact_ddot_ene_</div>
<div>47710 bbrandt6  20   0  746m  17m 6200 R 98.6 &nbs=
=3D
p;0.0   2:43.91 exact_ddot_ene_</div>
<div><br>
</div>
<div>Again just as expected, one process copied the data and has now a=
=3D
 memory footprint of<span style=3D3D"font-family: Calibri, Arial, Helvetica=
, =3D
sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe U=
=3D
I Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;"> 746m
 VIRT and 612m RES. Now the other processes start accessing the data a=
=3D
nd we get:</span></div>
<div><span style=3D3D"font-family: Calibri, Arial, Helvetica, sans-serif, '=
Ap=3D
ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'And=
=3D
roid Emoji', EmojiSymbols; font-size: 16px;"><br>
</span></div>
<div><span style=3D3D"font-family: Calibri, Arial, Helvetica, sans-serif, '=
Ap=3D
ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'And=
=3D
roid Emoji', EmojiSymbols; font-size: 16px;"><span style=3D3D"font-family: =
Ca=3D
libri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji',=
=3D
 NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-siz=
=3D
e: 16px;">PID
      USER         PR  NI  VIR=
=3D
T    RES  SHR   S %CPU %MEM    TIME+ &nbs=
=3D
p;COMMAND</span><br>
</span></div>
<div><span style=3D3D"font-family: Calibri, Arial, Helvetica, sans-serif, '=
Ap=3D
ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'And=
=3D
roid Emoji', EmojiSymbols; font-size: 16px;">
<div>47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.=
=3D
3   3:49.37 exact_ddot_ene_</div>
<div>47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.=
=3D
0   3:43.93 exact_ddot_ene_</div>
<div>47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.=
=3D
3   3:23.03 exact_ddot_ene_</div>
<div>47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.=
=3D
3   3:26.86 exact_ddot_ene_</div>
<div>47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.=
=3D
3   3:44.01 exact_ddot_ene_</div>
<div><br>
</div>
<div>which increases to <span style=3D3D"font-family: Calibri, Arial, =
He=3D
lvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji,=
=3D
 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;">787m V=
=3D
IRT 653m RES for all processes once they have
 accessed all the data in the shared memory. So the memory footprint is jus=
=3D
t as large as if every process held it's own copy of the data. So at this p=
=3D
oint it seems like we haven't saved any memory at all. We might have gained=
=3D
 speed and bandwith but using the
 shared memory did not reduce the memory footprint of our application.&nbsp=
=3D
;</span></div>
<div><br>
</div>
<div>If we run this job on a cluster with a job scheduler and resource=
=3D
 manager our jobs will be aborted if we expect the shared memory to co=
=3D
unt only once. So how can we work around this problem? Is the cause of this=
=3D
 problem that mvapich runs different processes
 so shared memory counts fully towards each whereas openmp runs only one pr=
=3D
ocess but multiple threads so the shared memory counts only once? How =
=3D
could a resource manager (or the operating system) correctly determine memo=
=3D
ry consumption? </div>
<div><br>
</div>
<div>=3D3D=3D3D=3D3D end long version =3D3D=3D3D=3D3D</div>
<div><br>
</div>
<div>Any thoughts and any comments are truly appreciated</div>
<div><br>
</div>
<div>Thanks a lot</div>
<div><br>
</div>
<div>Benedikt</div>
<div></div>
<br>
</span></div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<br>
<p></p>
</div>
</body>
</html>

--_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_--

--=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D3995625848783985513=3D=3D
Content-Type: text/plain; charset=3D"us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.ed=
u>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

--=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D3995625848783985513=3D=3D--
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.ed=
u>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

--_000_SN1PR0701MB1856ED6BF530B034B1297011A4F70SN1PR0701MB1856_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
1">
<style type=3D"text/css" style=3D"display:none;"><!-- P {margin-top:0;margi=
n-bottom:0;} --></style>
</head>
<body dir=3D"ltr">
<div id=3D"divtagdefaultwrapper" style=3D"font-size:12pt;color:#000000;back=
ground-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">
<p></p>
<div>Hi Mingzhe</div>
<div><br>
</div>
<div>Thanks a lot for your reply! Yes I did execute MPI_Win_allocate_shared=
</div>
<div>on each MPI process. And the shared memory is about 700mb.</div>
<div>I have attached the relevant code snipped at the</div>
<div>end of this mail. From the documentation I read that</div>
<div>MPI_Win_allocate_shared  is a collective call, so I do have to ca=
ll it</div>
<div>from every processes that is supposed to use the shared memory,</div>
<div>right? </div>
<div><span style=3D"font-size: 12pt;"></span></div>
<div><span style=3D"font-size: 12pt;"><br>
</span></div>
<div>I guess the question I am truly asking is: Is there a way to use MPI&n=
bsp;</div>
<div>shared <span style=3D"font-size: 12pt;">memory s</span><span styl=
e=3D"font-size: 12pt;">o that the shared memory is being accessible by all<=
/span></div>
<div>processes but only counted once (like threads in OpenMP)?</d=
iv>
<div><span style=3D"font-size: 12pt;"></span></div>
<div><br>
</div>
<div>Thanks a lot</div>
<div><br>
</div>
<div>Benedikt</div>
<div><br>
</div>
<div>=3D=3D=3D=3D=3D Code sample below =3D=3D=3D=3D=3D=3D</div>
<div><br>
</div>
<div><br>
</div>
<div>    CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_S=
HARED&</div>
<div>, 0, MPI_INFO_NULL, hostcomm,ierr)</div>
<div>    CALL MPI_Comm_rank(hostcomm, hostrank,ierr)</div>
<div><br>
</div>
<div>    allocate(arrayshape(4))</div>
<div>    arrayshape=3D(/ nroot1,nroot1,nroot1,nroot1 /)</div>
<div>    if (hostrank =3D=3D 0) then</div>
<div>        windowsize =3D int(nroot1**4,MPI_ADDRESS_K=
IND)*&</div>
<div>8_MPI_ADDRESS_KIND ! *8 since there are 8 bytes in a double</div>
<div>    else</div>
<div>        windowsize =3D 0_MPI_ADDRESS_KIND</div>
<div>    end if</div>
<div>    disp_unit =3D 1</div>
<div><br>
</div>
<div>    CALL MPI_Win_allocate_shared(windowsize, disp_unit, &amp=
;</div>
<div>MPI_INFO_NULL, hostcomm, baseptr, win,ierr)</div>
<div>    CALL MPI_Win_allocate_shared(windowsize, disp_unit, &amp=
;</div>
<div>MPI_INFO_NULL, hostcomm, baseptr2, win2,ierr)</div>
<div><br>
</div>
<div>    ! Obtain the location of the memory segment</div>
<div>    if (hostrank /=3D 0) then</div>
<div>        CALL MPI_Win_shared_query(win, 0, windowsi=
ze, disp_unit,&</div>
<div> baseptr,ierr)</div>
<div>        CALL MPI_Win_shared_query(win2, 0, windows=
ize, disp_unit,&</div>
<div> baseptr2,ierr)</div>
<div>    end if</div>
<div><br>
</div>
<div>    ! baseptr can now be associated with a Fortran pointer</=
div>
<div>    ! and thus used to access the shared data</div>
<div>    CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)</=
div>
<div>    CALL C_F_POINTER(baseptr2, matrix_elementsz,arrayshape)<=
/div>
<br>
<p></p>
<div style=3D"color: rgb(0, 0, 0);">
<hr tabindex=3D"-1" style=3D"display:inline-block; width:98%" customtabinde=
x=3D"-1" disabled=3D"true">
<div id=3D"divRplyFwdMsg" dir=3D"ltr"><font face=3D"Calibri, sans-serif" co=
lor=3D"#000000" style=3D"font-size:11pt"><b>From:</b>  mingzhe0908 at gmail.com=
 <mingzhe0908 at gmail.com> on behalf of Mingzhe Li <li.2192 at osu.edu&=
gt;<br>
<b>Sent:</b> Saturday, January 9, 2016 11:34 AM<br>
<b>To:</b> Brandt, Benedikt B<br>
<b>Cc:</b> mvapich-discuss at cse.ohio-state.edu<br>
<b>Subject:</b> Re:</font>
<div> </div>
</div>
<div>
<div dir=3D"ltr">Hi Benedikt,
<div><br>
</div>
<div>Thanks for your note. Did you allocate around 700mb of shared memory w=
ith MPI_Win_allocate_shared for each MPI process? If that's the case, the m=
emory consumption will be the same as using malloc for each MPI process.&nb=
sp;</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Mingzhe</div>
<div class=3D"gmail_extra"><br>
<div class=3D"gmail_quote">On Fri, Jan 8, 2016 at 11:01 AM, Brandt, Benedik=
t B <span dir=3D"ltr">
<<a href=3D"mailto:benbra at gatech.edu" target=3D"_blank" title=3D"mailto:=
benbra at gatech.edu=0A=
Ctrl+Click or tap to follow the link" tabindex=3D"-1" disabled=3D"true"=
>benbra at gatech.edu</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex; border-left:1=
px #ccc solid; padding-left:1ex">
Please excuse the terrible formatting of my last mail. This was the<br>
first time I submitted to this list. Here is a well formatted<br>
version:<br>
<span><br>
Dear mvapich community<br>
<br>
I am currently testing the MPI-3 shared memory routines for use in our<br>
</span>application. The goal is to reduce the memory footprint of our<br>
application per node.<br>
<span><br>
The code seems to work but I get the following odd behavior when I<br>
</span>monitor the memory usage:<br>
<span><br>
TLDR: Shared memory that is "touched" (read or written) by an MPI=
<br>
</span>process counts towards that process's real memory (RSS, RES) value. =
If<br>
every process accesses the whole shared memory (=3D data), the memory<br>
consumption as seen by top (or other monitoring tools) is the same as<br>
if every process had it's own copy of the data.<br>
<span><br>
If we run this job on a cluster with a job scheduler and resource<br>
</span>manager our jobs will be aborted if we expect the shared memory to<b=
r>
count only once. So how can we work around this problem? How could a<br>
resource manager (or the operating system) correctly determine memory<br>
consumption?<br>
<br>
=3D=3D=3D Long version: =3D=3D=3D<br>
<span><br>
Running our code compiled with mvapich (2.1)  and ifort (15) on one<br=
>
</span><span>node, I see the following memory footprint right after startin=
g the<br>
</span>program:<br>
<br>
PID   USER      PR  NI  VIRT  RES&n=
bsp; SHR S %CPU  %MEM   TIME+  COMMAND<br>
<span>47708 bbrandt6  20   0  746m  14m 6064 R 100=
.0  0.0   0:22.57 exa<br>
</span><span>47707 bbrandt6  20   0  746m  14m 616=
4 R 100.0  0.0   0:22.56 exa<br>
</span><span>47709 bbrandt6  20   0  746m  14m 602=
0 R 100.0  0.0   0:22.58 exa<br>
</span><span>47710 bbrandt6  20   0  746m  14m 605=
6 R 100.0  0.0   0:22.55 exa<br>
</span><span>47711 bbrandt6  20   0  746m  14m 607=
2 R 100.0  0.0   0:22.57 exa<br>
<br>
<br>
</span><span>This is as expected since we allocate about 700mb of shared me=
mory<br>
</span>using MPI_Win_allocate_shared. After copying the data into the share=
d<br>
memory it looks like this<br>
<br>
<br>
PID   USER      PR  NI  VIRT  RES&n=
bsp; SHR S %CPU  %MEM   TIME+  COMMAND<br>
<span>47711 bbrandt6  20   0  746m  17m 6216 R 100=
.0  0.0   3:01.03 exa<br>
</span><span>47708 bbrandt6  20   0  746m  17m 621=
2 R 99.6  0.0   2:40.07 exa<br>
</span><span>47707 bbrandt6  20   0  746m 612m 600m R 9=
9.3  0.9   3:01.33 exa<br>
</span><span>47709 bbrandt6  20   0  746m  17m 616=
4 R 98.6  0.0   3:06.72 exa<br>
</span><span>47710 bbrandt6  20   0  746m  17m 620=
0 R 98.6  0.0   2:43.91 exa<br>
<br>
</span><span>Again just as expected, one process copied the data and has no=
w a<br>
</span>memory footprint of 746m VIRT and 612m RES. Now the other processes<=
br>
start accessing the data and we get:<br>
<br>
PID   USER      PR  NI  VIRT  RES&n=
bsp; SHR S %CPU  %MEM   TIME+  COMMAND<br>
<span>47709 bbrandt6  20   0  785m 214m 165m R 100.0&nb=
sp; 0.3   3:49.37 exa<br>
</span><span>47707 bbrandt6  20   0  785m 653m 602m R 1=
00.0  1.0   3:43.93 exa<br>
</span><span>47708 bbrandt6  20   0  785m 214m 166m R 1=
00.0  0.3   3:23.03 exa<br>
</span><span>47710 bbrandt6  20   0  785m 214m 166m R 1=
00.0  0.3   3:26.86 exa<br>
</span><span>47711 bbrandt6  20   0  785m 214m 166m R 1=
00.0  0.3   3:44.01 exa<br>
<br>
</span><span>which increases to 787m VIRT 653m RES for all processes once t=
hey have<br>
</span><span>accessed all the data in the shared memory. So the memory foot=
print is<br>
</span>just as large as if every process held it's own copy of the data. So=
<br>
at this point it seems like we haven't saved any memory at all. We<br>
might have gained speed and bandwith but using the shared memory did<br>
not reduce the memory footprint of our application.<br>
<span><br>
If we run this job on a cluster with a job scheduler and resource<br>
</span>manager our jobs will be aborted if we expect the shared memory to<b=
r>
count only once. So how can we work around this problem? Is the cause<br>
of this problem that mvapich runs different processes so shared memory<br>
counts fully towards each whereas openmp runs only one process but<br>
multiple threads so the shared memory counts only once? How could a<br>
resource manager (or the operating system) correctly determine memory<br>
consumption?<br>
<br>
=3D=3D=3D end long version =3D=3D=3D<br>
<span><br>
Any thoughts and any comments are truly appreciated<br>
<br>
Thanks a lot<br>
<br>
Benedikt<br>
<br>
</span>________________________________________<br>
From: Brandt, Benedikt B <<a href=3D"mailto:benbra at gatech.edu" target=3D=
"_blank" tabindex=3D"-1" disabled=3D"true">benbra at gatech.edu</a>><br>
Sent: Friday, January 8, 2016 10:46 AM<br>
To: <a href=3D"mailto:mvapich-discuss at cse.ohio-state.edu" target=3D"_blank"=
 tabindex=3D"-1" disabled=3D"true">
mvapich-discuss at cse.ohio-state.edu</a><br>
Subject:<br>
<div>
<div><br>
--=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D3995625848783985513=3D=3D<br=
>
Content-Language: en-US<br>
Content-Type: multipart/alternative;<br>
        boundary=3D"_000_SN1PR0701MB1856179D6E2940=
D9AC7322FCA4F60SN1PR0701MB1856_"<br>
<br>
--_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_<br>
Content-Type: text/plain; charset=3D"iso-8859-1"<br>
Content-Transfer-Encoding: quoted-printable<br>
<br>
Dear mvapich community<br>
<br>
<br>
I am currently testing the MPI-3 shared memory routines for use in our appl=
=3D<br>
ication. The goal is to reduce the memory footprint of our application per =
=3D<br>
node.<br>
<br>
<br>
The code seems to work but I get the following odd behavior when I monitor =
=3D<br>
the memory usage:<br>
<br>
<br>
TLDR: Shared memory that is "touched" (read or written) by an MPI=
 process c=3D<br>
ounts towards that process's real memory (RSS, RES) value. If every process=
=3D<br>
 accesses the whole shared memory (=3D3D data), the memory consumption=
 as see=3D<br>
n by top (or other monitoring tools) is the same as if every process had it=
=3D<br>
's own copy of the data.<br>
<br>
<br>
If we run this job on a cluster with a job scheduler and resource manager o=
=3D<br>
ur jobs will be aborted if we expect the shared memory to count only once. =
=3D<br>
So how can we work around this problem? How could a resource manager (or th=
=3D<br>
e operating system) correctly determine memory consumption?<br>
<br>
<br>
=3D3D=3D3D=3D3D Long version: =3D3D=3D3D=3D3D<br>
<br>
<br>
Running our code compiled with mvapich (2.1)  and ifort (15) on one no=
de, I=3D<br>
 see the following memory footprint right after starting the program:<=
br>
<br>
<br>
PID      USER         PR  NI&n=
bsp; VIRT    RES  SHR   S %CPU %MEM    T=
IME+  COMM=3D<br>
AND<br>
<br>
47708 bbrandt6  20   0  746m  14m 6064 R 100.0&nbs=
p; 0.0   0:22.57 exact_ddot_en=3D<br>
e_<br>
47707 bbrandt6  20   0  746m  14m 6164 R 100.0&nbs=
p; 0.0   0:22.56 exact_ddot_en=3D<br>
e_<br>
47709 bbrandt6  20   0  746m  14m 6020 R 100.0&nbs=
p; 0.0   0:22.58 exact_ddot_en=3D<br>
e_<br>
47710 bbrandt6  20   0  746m  14m 6056 R 100.0&nbs=
p; 0.0   0:22.55 exact_ddot_en=3D<br>
e_<br>
47711 bbrandt6  20   0  746m  14m 6072 R 100.0&nbs=
p; 0.0   0:22.57 exact_ddot_en=3D<br>
e_<br>
<br>
<br>
This is as expected since we allocate about 700mb of shared memory using MP=
=3D<br>
I_Win_allocate_shared. After copying the data into the shared memory it loo=
=3D<br>
ks like this<br>
<br>
<br>
PID      USER         PR  NI&n=
bsp; VIRT    RES  SHR   S %CPU %MEM    T=
IME+  COMM=3D<br>
AND<br>
<br>
47711 bbrandt6  20   0  746m  17m 6216 R 100.0&nbs=
p; 0.0   3:01.03 exact_ddot_en=3D<br>
e_<br>
47708 bbrandt6  20   0  746m  17m 6212 R 99.6&nbsp=
; 0.0   2:40.07 exact_ddot_ene=3D<br>
_<br>
47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9=
   3:01.33 exact_ddot_ene=3D<br>
_<br>
47709 bbrandt6  20   0  746m  17m 6164 R 98.6&nbsp=
; 0.0   3:06.72 exact_ddot_ene=3D<br>
_<br>
47710 bbrandt6  20   0  746m  17m 6200 R 98.6&nbsp=
; 0.0   2:43.91 exact_ddot_ene=3D<br>
_<br>
<br>
Again just as expected, one process copied the data and has now a memory fo=
=3D<br>
otprint of 746m VIRT and 612m RES. Now the other processes start accessing =
=3D<br>
the data and we get:<br>
<br>
PID      USER         PR  NI&n=
bsp; VIRT    RES  SHR   S %CPU %MEM    T=
IME+  COMM=3D<br>
AND<br>
47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.=
3   3:49.37 exact_ddot_en=3D<br>
e_<br>
47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.=
0   3:43.93 exact_ddot_en=3D<br>
e_<br>
47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.=
3   3:23.03 exact_ddot_en=3D<br>
e_<br>
47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.=
3   3:26.86 exact_ddot_en=3D<br>
e_<br>
47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.=
3   3:44.01 exact_ddot_en=3D<br>
e_<br>
<br>
which increases to 787m VIRT 653m RES for all processes once they have acce=
=3D<br>
ssed all the data in the shared memory. So the memory footprint is just as =
=3D<br>
large as if every process held it's own copy of the data. So at this point =
=3D<br>
it seems like we haven't saved any memory at all. We might have gained spee=
=3D<br>
d and bandwith but using the shared memory did not reduce the memory footpr=
=3D<br>
int of our application.<br>
<br>
If we run this job on a cluster with a job scheduler and resource manager o=
=3D<br>
ur jobs will be aborted if we expect the shared memory to count only once. =
=3D<br>
So how can we work around this problem? Is the cause of this problem that m=
=3D<br>
vapich runs different processes so shared memory counts fully towards each =
=3D<br>
whereas openmp runs only one process but multiple threads so the shared mem=
=3D<br>
ory counts only once? How could a resource manager (or the operating system=
=3D<br>
) correctly determine memory consumption?<br>
<br>
=3D3D=3D3D=3D3D end long version =3D3D=3D3D=3D3D<br>
<br>
Any thoughts and any comments are truly appreciated<br>
<br>
Thanks a lot<br>
<br>
Benedikt<br>
<br>
<br>
<br>
<br>
<br>
<br>
--_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_<br>
Content-Type: text/html; charset=3D"iso-8859-1"<br>
Content-Transfer-Encoding: quoted-printable<br>
<br>
<html><br>
<head><br>
<meta http-equiv=3D3D"Content-Type" content=3D3D"text/htm=
l; charset=3D3Diso-8859-=3D<br>
1"><br>
<style type=3D3D"text/css" style=3D3D"display:none;"=
><!-- P {margin-top:0;margi=3D<br>
n-bottom:0;} --></style><br>
</head><br>
<body dir=3D3D"ltr"><br>
<div id=3D3D"divtagdefaultwrapper" style=3D3D"font-size:1=
2pt;color:#000000;back=3D<br>
ground-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;"&=
gt;<br>
<p>Dear mvapich community</p><br>
<p><br><br>
</p><br>
<p>I am currently testing the MPI-3 shared memory routines for use in=
 our a=3D<br>
pplication. The goal is to reduce the memory footprint of our application p=
=3D<br>
er node.&nbsp;</p><br>
<p><br><br>
</p><br>
<p>The code&nbsp;seems to work but I get the following odd behavi=
or when I =3D<br>
monitor the memory usage:</p><br>
<p><br><br>
</p><br>
<p>TLDR: Shared memory that is &quot;touched&quot;&nbsp;&=
lt;span style=3D3D"font=3D<br>
-family: Calibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe=
=3D<br>
 UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiS=
ymbol=3D<br>
s; font-size: 16px;">(read or written</span><span style=3D=
3D"font-family: Cal=3D<br>
ibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', =
=3D<br>
NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size=
=3D<br>
: 16px;">)</span><br>
 by an MPI process counts towards that process's real memory (RSS, RES=
)&nbs=3D<br>
p;value. If every process accesses&nbsp;the whole&nbsp;shared memor=
y (=3D3D d=3D<br>
ata), the memory consumption as seen by top (or other monitoring tools) is =
=3D<br>
the same as if every process had it's own copy of<br>
 the data.&nbsp;</p><br>
<p><br><br>
</p><br>
<p><span style=3D3D"font-family: Calibri, Arial, Helvetica, s=
ans-serif, 'Appl=3D<br>
e Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Andro=
=3D<br>
id Emoji', EmojiSymbols; font-size: 16px;">If we run this job on a =
cluster =3D<br>
with a&nbsp;job scheduler and resource<br>
 manager&nbsp;</span><span style=3D3D"font-family: C=
alibri, Arial, Helvetica,=3D<br>
 sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'S=
egoe =3D<br>
UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;">our jo=
bs will =3D<br>
be aborted if we expect the shared memory<br>
 to count only once. So how can we work around this problem?</span&=
gt;&nbsp;<s=3D<br>
pan style=3D3D"font-family: Calibri, Arial, Helvetica, sans-serif, 'Ap=
ple Col=3D<br>
or Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Em=
=3D<br>
oji', EmojiSymbols; font-size: 16px;">How<br>
 could a resource manager (or the operating system</span><spa=
n style=3D3D"fon=3D<br>
t-family: Calibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Sego=
=3D<br>
e UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbo=
=3D<br>
ls; font-size: 16px;">) correctly<br>
 determine memory consumption?</span>&nbsp;</p><br>
<p><br><br>
</p><br>
<p>=3D3D=3D3D=3D3D Long version: =3D3D=3D3D=3D3D</p><br>
<p><br><br>
</p><br>
<p>Running our code compiled with&nbsp;mvapich (2.1) &nbsp;an=
d ifort (15) o=3D<br>
n one node, I see the following memory footprint right after starting the p=
=3D<br>
rogram:</p><br>
<p><span style=3D3D"font-size: 12pt;"><br><br>
</span></p><br>
<p><span style=3D3D"font-size: 12pt;">PID &nbsp; =
&nbsp; &nbsp;USER &nbsp; &nb=3D<br>
sp; &nbsp; &nbsp; PR &nbsp;NI &nbsp;VIRT &nbsp; &nb=
sp;RES &nbsp;SHR &nbsp; =3D<br>
S %CPU %MEM &nbsp; &nbsp;TIME&#43; &nbsp;COMMAND</span&g=
t;<br><br>
</p><br>
<p></p><br>
<div></div><br>
<div>47708 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;746m &nbs=
p;14m 6064 R 100.0 &nb=3D<br>
sp;0.0 &nbsp; 0:22.57 exact_ddot_ene_<span style=3D3D"font-size=
: 12pt;"></spa=3D<br>
n></div><br>
<div>47707 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;746m &nbs=
p;14m 6164 R 100.0 &nb=3D<br>
sp;0.0 &nbsp; 0:22.56 exact_ddot_ene_</div><br>
<div>47709 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;746m &nbs=
p;14m 6020 R 100.0 &nb=3D<br>
sp;0.0 &nbsp; 0:22.58 exact_ddot_ene_</div><br>
<div>47710 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;746m &nbs=
p;14m 6056 R 100.0 &nb=3D<br>
sp;0.0 &nbsp; 0:22.55 exact_ddot_ene_</div><br>
<div><span style=3D3D"font-size: 12pt;">47711 bbrandt=
6 &nbsp;20 &nbsp; 0 &nbs=3D<br>
p;746m &nbsp;14m 6072 R 100.0 &nbsp;0.0 &nbsp; 0:22.57 exact_dd=
ot_ene_</spa=3D<br>
n></div><br>
<div><span style=3D3D"font-size: 12pt;"><br><br=
>
</span></div><br>
<p></p><br>
<p>This is as expected since we allocate about 700mb of shared memory=
 using=3D<br>
&nbsp;MPI_Win_allocate_shared. A<span style=3D3D"font-size: 12p=
t;">fter copyi=3D<br>
ng the data into the shared memory it looks like this</span></p&gt=
;<br>
<p><span style=3D3D"font-size: 12pt;"><br><br>
</span></p><br>
<p><span style=3D3D"font-size: 12pt;"><span style=
=3D3D"font-family: Calibri, Ar=3D<br>
ial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColo=
=3D<br>
rEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;&=
quot;=3D<br>
>PID &nbsp; &nbsp; &nbsp;USER &nbsp; &nbsp; &nbs=
p; &nbsp; PR &nbsp;NI &nbsp=3D<br>
;VIRT<br>
 &nbsp; &nbsp;RES &nbsp;SHR &nbsp; S %CPU %MEM &nb=
sp; &nbsp;TIME&#43; &nbsp=3D<br>
;COMMAND</span><br><br>
</span></p><br>
<p></p><br>
<div>47711 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;746m &nbs=
p;17m 6216 R 100.0 &nb=3D<br>
sp;0.0 &nbsp; 3:01.03 exact_ddot_ene_<span style=3D3D"font-size=
: 12pt;"></spa=3D<br>
n></div><br>
<div>47708 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;746m &nbs=
p;17m 6212 R 99.6 &nbs=3D<br>
p;0.0 &nbsp; 2:40.07 exact_ddot_ene_</div><br>
<div>47707 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;746m 612m 600=
m R 99.3 &nbsp;0.9=3D<br>
 &nbsp; 3:01.33 exact_ddot_ene_</div><br>
<div>47709 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;746m &nbs=
p;17m 6164 R 98.6 &nbs=3D<br>
p;0.0 &nbsp; 3:06.72 exact_ddot_ene_</div><br>
<div>47710 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;746m &nbs=
p;17m 6200 R 98.6 &nbs=3D<br>
p;0.0 &nbsp; 2:43.91 exact_ddot_ene_</div><br>
<div><br><br>
</div><br>
<div>Again just as expected, one process copied&nbsp;the data and=
 has now a=3D<br>
 memory footprint of<span style=3D3D"font-family: Calibri, Ari=
al, Helvetica, =3D<br>
sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe U=
=3D<br>
I Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;">&nb=
sp;746m<br>
 VIRT and&nbsp;612m RES. Now the other processes start accessing t=
he data a=3D<br>
nd we get:</span></div><br>
<div><span style=3D3D"font-family: Calibri, Arial, Helvetica,=
 sans-serif, 'Ap=3D<br>
ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'And=
=3D<br>
roid Emoji', EmojiSymbols; font-size: 16px;"><br><br>
</span></div><br>
<div><span style=3D3D"font-family: Calibri, Arial, Helvetica,=
 sans-serif, 'Ap=3D<br>
ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'And=
=3D<br>
roid Emoji', EmojiSymbols; font-size: 16px;"><span style=3D3D&qu=
ot;font-family: Ca=3D<br>
libri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji',=
=3D<br>
 NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; fon=
t-siz=3D<br>
e: 16px;">PID<br>
 &nbsp; &nbsp; &nbsp;USER &nbsp; &nbsp; &nbsp;=
 &nbsp; PR &nbsp;NI &nbsp;VIR=3D<br>
T &nbsp; &nbsp;RES &nbsp;SHR &nbsp; S %CPU %MEM &nbsp; =
&nbsp;TIME&#43; &nbs=3D<br>
p;COMMAND</span><br><br>
</span></div><br>
<div><span style=3D3D"font-family: Calibri, Arial, Helvetica,=
 sans-serif, 'Ap=3D<br>
ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'And=
=3D<br>
roid Emoji', EmojiSymbols; font-size: 16px;"><br>
<div>47709 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;785m 214m 165=
m R 100.0 &nbsp;0.=3D<br>
3 &nbsp; 3:49.37 exact_ddot_ene_</div><br>
<div>47707 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;785m 653m 602=
m R 100.0 &nbsp;1.=3D<br>
0 &nbsp; 3:43.93 exact_ddot_ene_</div><br>
<div>47708 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;785m 214m 166=
m R 100.0 &nbsp;0.=3D<br>
3 &nbsp; 3:23.03 exact_ddot_ene_</div><br>
<div>47710 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;785m 214m 166=
m R 100.0 &nbsp;0.=3D<br>
3 &nbsp; 3:26.86 exact_ddot_ene_</div><br>
<div>47711 bbrandt6 &nbsp;20 &nbsp; 0 &nbsp;785m 214m 166=
m R 100.0 &nbsp;0.=3D<br>
3 &nbsp; 3:44.01 exact_ddot_ene_</div><br>
<div><br><br>
</div><br>
<div>which increases to&nbsp;<span style=3D3D"font-family=
: Calibri, Arial, He=3D<br>
lvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji,=
=3D<br>
 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;&qu=
ot;>787m V=3D<br>
IRT 653m RES for&nbsp;all processes once they&nbsp;have<br>
 accessed all the data in the shared memory. So the memory footprint i=
s jus=3D<br>
t as large as if every process held it's own copy of the data. So at this p=
=3D<br>
oint it seems like we haven't saved any memory at all. We might have gained=
=3D<br>
 speed and bandwith but using the<br>
 shared memory did not reduce the memory footprint of our application.=
&nbsp=3D<br>
;</span></div><br>
<div><br><br>
</div><br>
<div>If we run this job on a cluster with a&nbsp;job scheduler an=
d resource=3D<br>
 manager&nbsp;our jobs will be aborted if we expect the shared mem=
ory to co=3D<br>
unt only once. So how can we work around this problem? Is the cause of this=
=3D<br>
 problem that mvapich runs different processes<br>
 so shared memory counts fully towards each whereas openmp runs only o=
ne pr=3D<br>
ocess but multiple&nbsp;threads so the shared memory counts only once? =
How =3D<br>
could a resource manager (or the operating system) correctly determine memo=
=3D<br>
ry consumption?&nbsp;</div><br>
<div><br><br>
</div><br>
<div>=3D3D=3D3D=3D3D end long version =3D3D=3D3D=3D3D</div><br>
<div><br><br>
</div><br>
<div>Any thoughts and any comments are truly appreciated</div><=
br>
<div><br><br>
</div><br>
<div>Thanks a lot</div><br>
<div><br><br>
</div><br>
<div>Benedikt</div><br>
<div></div><br>
<br><br>
</span></div><br>
<div><br><br>
</div><br>
<div><br><br>
</div><br>
<div><br><br>
</div><br>
<br><br>
<p></p><br>
</div><br>
</body><br>
</html><br>
<br>
--_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_--<br>
<br>
--=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D3995625848783985513=3D=3D<br=
>
Content-Type: text/plain; charset=3D"us-ascii"<br>
MIME-Version: 1.0<br>
Content-Transfer-Encoding: 7bit<br>
Content-Disposition: inline<br>
<br>
_______________________________________________<br>
mvapich-discuss mailing list<br>
<a href=3D"mailto:mvapich-discuss at cse.ohio-state.edu" target=3D"_blank" tab=
index=3D"-1" disabled=3D"true">mvapich-discuss at cse.ohio-state.edu</a><br>
<a href=3D"http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discu=
ss" rel=3D"noreferrer" target=3D"_blank" tabindex=3D"-1" disabled=3D"true">=
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss</a><br>
<br>
--=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D3995625848783985513=3D=3D--<=
br>
_______________________________________________<br>
mvapich-discuss mailing list<br>
<a href=3D"mailto:mvapich-discuss at cse.ohio-state.edu" target=3D"_blank" tab=
index=3D"-1" disabled=3D"true">mvapich-discuss at cse.ohio-state.edu</a><br>
<a href=3D"http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discu=
ss" rel=3D"noreferrer" target=3D"_blank" tabindex=3D"-1" disabled=3D"true">=
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss</a><br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div>
</div>
</body>
</html>

--_000_SN1PR0701MB1856ED6BF530B034B1297011A4F70SN1PR0701MB1856_--

--===============2141258889965369599==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

--===============2141258889965369599==--