[mvapich-discuss] (no subject)

Sat Jan 9 11:34:59 EST 2016

Hi Benedikt,

Thanks for your note. Did you allocate around 700mb of shared memory with
MPI_Win_allocate_shared for each MPI process? If that's the case, the
memory consumption will be the same as using malloc for each MPI process.

Thanks,
Mingzhe

On Fri, Jan 8, 2016 at 11:01 AM, Brandt, Benedikt B <benbra at gatech.edu>
wrote:

> Please excuse the terrible formatting of my last mail. This was the
> first time I submitted to this list. Here is a well formatted
> version:
>
> Dear mvapich community
>
> I am currently testing the MPI-3 shared memory routines for use in our
> application. The goal is to reduce the memory footprint of our
> application per node.
>
> The code seems to work but I get the following odd behavior when I
> monitor the memory usage:
>
> TLDR: Shared memory that is "touched" (read or written) by an MPI
> process counts towards that process's real memory (RSS, RES) value. If
> every process accesses the whole shared memory (= data), the memory
> consumption as seen by top (or other monitoring tools) is the same as
> if every process had it's own copy of the data.
>
> If we run this job on a cluster with a job scheduler and resource
> manager our jobs will be aborted if we expect the shared memory to
> count only once. So how can we work around this problem? How could a
> resource manager (or the operating system) correctly determine memory
> consumption?
>
> === Long version: ===
>
> Running our code compiled with mvapich (2.1)  and ifort (15) on one
> node, I see the following memory footprint right after starting the
> program:
>
> PID   USER      PR  NI  VIRT  RES  SHR S %CPU  %MEM   TIME+  COMMAND
> 47708 bbrandt6  20   0  746m  14m 6064 R 100.0  0.0   0:22.57 exa
> 47707 bbrandt6  20   0  746m  14m 6164 R 100.0  0.0   0:22.56 exa
> 47709 bbrandt6  20   0  746m  14m 6020 R 100.0  0.0   0:22.58 exa
> 47710 bbrandt6  20   0  746m  14m 6056 R 100.0  0.0   0:22.55 exa
> 47711 bbrandt6  20   0  746m  14m 6072 R 100.0  0.0   0:22.57 exa
>
>
> This is as expected since we allocate about 700mb of shared memory
> using MPI_Win_allocate_shared. After copying the data into the shared
> memory it looks like this
>
>
> PID   USER      PR  NI  VIRT  RES  SHR S %CPU  %MEM   TIME+  COMMAND
> 47711 bbrandt6  20   0  746m  17m 6216 R 100.0  0.0   3:01.03 exa
> 47708 bbrandt6  20   0  746m  17m 6212 R 99.6  0.0   2:40.07 exa
> 47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9   3:01.33 exa
> 47709 bbrandt6  20   0  746m  17m 6164 R 98.6  0.0   3:06.72 exa
> 47710 bbrandt6  20   0  746m  17m 6200 R 98.6  0.0   2:43.91 exa
>
> Again just as expected, one process copied the data and has now a
> memory footprint of 746m VIRT and 612m RES. Now the other processes
> start accessing the data and we get:
>
> PID   USER      PR  NI  VIRT  RES  SHR S %CPU  %MEM   TIME+  COMMAND
> 47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.3   3:49.37 exa
> 47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.0   3:43.93 exa
> 47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:23.03 exa
> 47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:26.86 exa
> 47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:44.01 exa
>
> which increases to 787m VIRT 653m RES for all processes once they have
> accessed all the data in the shared memory. So the memory footprint is
> just as large as if every process held it's own copy of the data. So
> at this point it seems like we haven't saved any memory at all. We
> might have gained speed and bandwith but using the shared memory did
> not reduce the memory footprint of our application.
>
> If we run this job on a cluster with a job scheduler and resource
> manager our jobs will be aborted if we expect the shared memory to
> count only once. So how can we work around this problem? Is the cause
> of this problem that mvapich runs different processes so shared memory
> counts fully towards each whereas openmp runs only one process but
> multiple threads so the shared memory counts only once? How could a
> resource manager (or the operating system) correctly determine memory
> consumption?
>
> === end long version ===
>
> Any thoughts and any comments are truly appreciated
>
> Thanks a lot
>
> Benedikt
>
> ________________________________________
> From: Brandt, Benedikt B <benbra at gatech.edu>
> Sent: Friday, January 8, 2016 10:46 AM
> To: mvapich-discuss at cse.ohio-state.edu
> Subject:
>
> --===============3995625848783985513==
> Content-Language: en-US
> Content-Type: multipart/alternative;
>
> boundary="_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_"
>
> --_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_
> Content-Type: text/plain; charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> Dear mvapich community
>
>
> I am currently testing the MPI-3 shared memory routines for use in our
> appl=
> ication. The goal is to reduce the memory footprint of our application per
> =
> node.
>
>
> The code seems to work but I get the following odd behavior when I monitor
> =
> the memory usage:
>
>
> TLDR: Shared memory that is "touched" (read or written) by an MPI process
> c=
> ounts towards that process's real memory (RSS, RES) value. If every
> process=
>  accesses the whole shared memory (=3D data), the memory consumption as
> see=
> n by top (or other monitoring tools) is the same as if every process had
> it=
> 's own copy of the data.
>
>
> If we run this job on a cluster with a job scheduler and resource manager
> o=
> ur jobs will be aborted if we expect the shared memory to count only once.
> =
> So how can we work around this problem? How could a resource manager (or
> th=
> e operating system) correctly determine memory consumption?
>
>
> =3D=3D=3D Long version: =3D=3D=3D
>
>
> Running our code compiled with mvapich (2.1)  and ifort (15) on one node,
> I=
>  see the following memory footprint right after starting the program:
>
>
> PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+
> COMM=
> AND
>
> 47708 bbrandt6  20   0  746m  14m 6064 R 100.0  0.0   0:22.57
> exact_ddot_en=
> e_
> 47707 bbrandt6  20   0  746m  14m 6164 R 100.0  0.0   0:22.56
> exact_ddot_en=
> e_
> 47709 bbrandt6  20   0  746m  14m 6020 R 100.0  0.0   0:22.58
> exact_ddot_en=
> e_
> 47710 bbrandt6  20   0  746m  14m 6056 R 100.0  0.0   0:22.55
> exact_ddot_en=
> e_
> 47711 bbrandt6  20   0  746m  14m 6072 R 100.0  0.0   0:22.57
> exact_ddot_en=
> e_
>
>
> This is as expected since we allocate about 700mb of shared memory using
> MP=
> I_Win_allocate_shared. After copying the data into the shared memory it
> loo=
> ks like this
>
>
> PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+
> COMM=
> AND
>
> 47711 bbrandt6  20   0  746m  17m 6216 R 100.0  0.0   3:01.03
> exact_ddot_en=
> e_
> 47708 bbrandt6  20   0  746m  17m 6212 R 99.6  0.0   2:40.07
> exact_ddot_ene=
> _
> 47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9   3:01.33
> exact_ddot_ene=
> _
> 47709 bbrandt6  20   0  746m  17m 6164 R 98.6  0.0   3:06.72
> exact_ddot_ene=
> _
> 47710 bbrandt6  20   0  746m  17m 6200 R 98.6  0.0   2:43.91
> exact_ddot_ene=
> _
>
> Again just as expected, one process copied the data and has now a memory
> fo=
> otprint of 746m VIRT and 612m RES. Now the other processes start accessing
> =
> the data and we get:
>
> PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+
> COMM=
> AND
> 47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.3   3:49.37
> exact_ddot_en=
> e_
> 47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.0   3:43.93
> exact_ddot_en=
> e_
> 47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:23.03
> exact_ddot_en=
> e_
> 47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:26.86
> exact_ddot_en=
> e_
> 47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:44.01
> exact_ddot_en=
> e_
>
> which increases to 787m VIRT 653m RES for all processes once they have
> acce=
> ssed all the data in the shared memory. So the memory footprint is just as
> =
> large as if every process held it's own copy of the data. So at this point
> =
> it seems like we haven't saved any memory at all. We might have gained
> spee=
> d and bandwith but using the shared memory did not reduce the memory
> footpr=
> int of our application.
>
> If we run this job on a cluster with a job scheduler and resource manager
> o=
> ur jobs will be aborted if we expect the shared memory to count only once.
> =
> So how can we work around this problem? Is the cause of this problem that
> m=
> vapich runs different processes so shared memory counts fully towards each
> =
> whereas openmp runs only one process but multiple threads so the shared
> mem=
> ory counts only once? How could a resource manager (or the operating
> system=
> ) correctly determine memory consumption?
>
> =3D=3D=3D end long version =3D=3D=3D
>
> Any thoughts and any comments are truly appreciated
>
> Thanks a lot
>
> Benedikt
>
>
>
>
>
>
> --_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_
> Content-Type: text/html; charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
>
> <html>
> <head>
> <meta http-equiv=3D"Content-Type" content=3D"text/html;
> charset=3Diso-8859-=
> 1">
> <style type=3D"text/css" style=3D"display:none;"><!-- P
> {margin-top:0;margi=
> n-bottom:0;} --></style>
> </head>
> <body dir=3D"ltr">
> <div id=3D"divtagdefaultwrapper"
> style=3D"font-size:12pt;color:#000000;back=
> ground-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">
> <p>Dear mvapich community</p>
> <p><br>
> </p>
> <p>I am currently testing the MPI-3 shared memory routines for use in our
> a=
> pplication. The goal is to reduce the memory footprint of our application
> p=
> er node. </p>
> <p><br>
> </p>
> <p>The code seems to work but I get the following odd behavior when I
> =
> monitor the memory usage:</p>
> <p><br>
> </p>
> <p>TLDR: Shared memory that is "touched" <span
> style=3D"font=
> -family: Calibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji',
> 'Segoe=
>  UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji',
> EmojiSymbol=
> s; font-size: 16px;">(read or written</span><span style=3D"font-family:
> Cal=
> ibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji',
> =
> NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols;
> font-size=
> : 16px;">)</span>
>  by an MPI process counts towards that process's real memory (RSS,
> RES)&nbs=
> p;value. If every process accesses the whole shared memory (=3D
> d=
> ata), the memory consumption as seen by top (or other monitoring tools) is
> =
> the same as if every process had it's own copy of
>  the data. </p>
> <p><br>
> </p>
> <p><span style=3D"font-family: Calibri, Arial, Helvetica, sans-serif,
> 'Appl=
> e Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol',
> 'Andro=
> id Emoji', EmojiSymbols; font-size: 16px;">If we run this job on a cluster
> =
> with a job scheduler and resource
>  manager </span><span style=3D"font-family: Calibri, Arial,
> Helvetica,=
>  sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe
> =
> UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;">our jobs will
> =
> be aborted if we expect the shared memory
>  to count only once. So how can we work around this
> problem?</span> <s=
> pan style=3D"font-family: Calibri, Arial, Helvetica, sans-serif, 'Apple
> Col=
> or Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android
> Em=
> oji', EmojiSymbols; font-size: 16px;">How
>  could a resource manager (or the operating system</span><span
> style=3D"fon=
> t-family: Calibri, Arial, Helvetica, sans-serif, 'Apple Color Emoji',
> 'Sego=
> e UI Emoji', NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji',
> EmojiSymbo=
> ls; font-size: 16px;">) correctly
>  determine memory consumption?</span> </p>
> <p><br>
> </p>
> <p>=3D=3D=3D Long version: =3D=3D=3D</p>
> <p><br>
> </p>
> <p>Running our code compiled with mvapich (2.1)  and ifort (15)
> o=
> n one node, I see the following memory footprint right after starting the
> p=
> rogram:</p>
> <p><span style=3D"font-size: 12pt;"><br>
> </span></p>
> <p><span style=3D"font-size: 12pt;">PID      USER  
> &nb=
> sp;     PR  NI  VIRT    RES  SHR  
> =
> S %CPU %MEM    TIME+  COMMAND</span><br>
> </p>
> <p></p>
> <div></div>
> <div>47708 bbrandt6  20   0  746m  14m 6064 R 100.0
> &nb=
> sp;0.0   0:22.57 exact_ddot_ene_<span style=3D"font-size:
> 12pt;"></spa=
> n></div>
> <div>47707 bbrandt6  20   0  746m  14m 6164 R 100.0
> &nb=
> sp;0.0   0:22.56 exact_ddot_ene_</div>
> <div>47709 bbrandt6  20   0  746m  14m 6020 R 100.0
> &nb=
> sp;0.0   0:22.58 exact_ddot_ene_</div>
> <div>47710 bbrandt6  20   0  746m  14m 6056 R 100.0
> &nb=
> sp;0.0   0:22.55 exact_ddot_ene_</div>
> <div><span style=3D"font-size: 12pt;">47711 bbrandt6  20   0
> &nbs=
> p;746m  14m 6072 R 100.0  0.0   0:22.57
> exact_ddot_ene_</spa=
> n></div>
> <div><span style=3D"font-size: 12pt;"><br>
> </span></div>
> <p></p>
> <p>This is as expected since we allocate about 700mb of shared memory
> using=
>  MPI_Win_allocate_shared. A<span style=3D"font-size: 12pt;">fter
> copyi=
> ng the data into the shared memory it looks like this</span></p>
> <p><span style=3D"font-size: 12pt;"><br>
> </span></p>
> <p><span style=3D"font-size: 12pt;"><span style=3D"font-family: Calibri,
> Ar=
> ial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji',
> NotoColo=
> rEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size:
> 16px;"=
> >PID      USER         PR  NI
> &nbsp=
> ;VIRT
>     RES  SHR   S %CPU %MEM    TIME+
> &nbsp=
> ;COMMAND</span><br>
> </span></p>
> <p></p>
> <div>47711 bbrandt6  20   0  746m  17m 6216 R 100.0
> &nb=
> sp;0.0   3:01.03 exact_ddot_ene_<span style=3D"font-size:
> 12pt;"></spa=
> n></div>
> <div>47708 bbrandt6  20   0  746m  17m 6212 R 99.6
> &nbs=
> p;0.0   2:40.07 exact_ddot_ene_</div>
> <div>47707 bbrandt6  20   0  746m 612m 600m R 99.3
>  0.9=
>    3:01.33 exact_ddot_ene_</div>
> <div>47709 bbrandt6  20   0  746m  17m 6164 R 98.6
> &nbs=
> p;0.0   3:06.72 exact_ddot_ene_</div>
> <div>47710 bbrandt6  20   0  746m  17m 6200 R 98.6
> &nbs=
> p;0.0   2:43.91 exact_ddot_ene_</div>
> <div><br>
> </div>
> <div>Again just as expected, one process copied the data and has now
> a=
>  memory footprint of<span style=3D"font-family: Calibri, Arial, Helvetica,
> =
> sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe
> U=
> I Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;"> 746m
>  VIRT and 612m RES. Now the other processes start accessing the data
> a=
> nd we get:</span></div>
> <div><span style=3D"font-family: Calibri, Arial, Helvetica, sans-serif,
> 'Ap=
> ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol',
> 'And=
> roid Emoji', EmojiSymbols; font-size: 16px;"><br>
> </span></div>
> <div><span style=3D"font-family: Calibri, Arial, Helvetica, sans-serif,
> 'Ap=
> ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol',
> 'And=
> roid Emoji', EmojiSymbols; font-size: 16px;"><span style=3D"font-family:
> Ca=
> libri, Arial, Helvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI
> Emoji',=
>  NotoColorEmoji, 'Segoe UI Symbol', 'Android Emoji', EmojiSymbols;
> font-siz=
> e: 16px;">PID
>       USER         PR  NI
>  VIR=
> T    RES  SHR   S %CPU %MEM    TIME+
> &nbs=
> p;COMMAND</span><br>
> </span></div>
> <div><span style=3D"font-family: Calibri, Arial, Helvetica, sans-serif,
> 'Ap=
> ple Color Emoji', 'Segoe UI Emoji', NotoColorEmoji, 'Segoe UI Symbol',
> 'And=
> roid Emoji', EmojiSymbols; font-size: 16px;">
> <div>47709 bbrandt6  20   0  785m 214m 165m R 100.0
>  0.=
> 3   3:49.37 exact_ddot_ene_</div>
> <div>47707 bbrandt6  20   0  785m 653m 602m R 100.0
>  1.=
> 0   3:43.93 exact_ddot_ene_</div>
> <div>47708 bbrandt6  20   0  785m 214m 166m R 100.0
>  0.=
> 3   3:23.03 exact_ddot_ene_</div>
> <div>47710 bbrandt6  20   0  785m 214m 166m R 100.0
>  0.=
> 3   3:26.86 exact_ddot_ene_</div>
> <div>47711 bbrandt6  20   0  785m 214m 166m R 100.0
>  0.=
> 3   3:44.01 exact_ddot_ene_</div>
> <div><br>
> </div>
> <div>which increases to <span style=3D"font-family: Calibri, Arial,
> He=
> lvetica, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji',
> NotoColorEmoji,=
>  'Segoe UI Symbol', 'Android Emoji', EmojiSymbols; font-size: 16px;">787m
> V=
> IRT 653m RES for all processes once they have
>  accessed all the data in the shared memory. So the memory footprint is
> jus=
> t as large as if every process held it's own copy of the data. So at this
> p=
> oint it seems like we haven't saved any memory at all. We might have
> gained=
>  speed and bandwith but using the
>  shared memory did not reduce the memory footprint of our
> application.&nbsp=
> ;</span></div>
> <div><br>
> </div>
> <div>If we run this job on a cluster with a job scheduler and
> resource=
>  manager our jobs will be aborted if we expect the shared memory to
> co=
> unt only once. So how can we work around this problem? Is the cause of
> this=
>  problem that mvapich runs different processes
>  so shared memory counts fully towards each whereas openmp runs only one
> pr=
> ocess but multiple threads so the shared memory counts only once? How
> =
> could a resource manager (or the operating system) correctly determine
> memo=
> ry consumption? </div>
> <div><br>
> </div>
> <div>=3D=3D=3D end long version =3D=3D=3D</div>
> <div><br>
> </div>
> <div>Any thoughts and any comments are truly appreciated</div>
> <div><br>
> </div>
> <div>Thanks a lot</div>
> <div><br>
> </div>
> <div>Benedikt</div>
> <div></div>
> <br>
> </span></div>
> <div><br>
> </div>
> <div><br>
> </div>
> <div><br>
> </div>
> <br>
> <p></p>
> </div>
> </body>
> </html>
>
> --_000_SN1PR0701MB1856179D6E2940D9AC7322FCA4F60SN1PR0701MB1856_--
>
> --===============3995625848783985513==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> --===============3995625848783985513==--
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160109/de4cd0d1/attachment-0001.html>