[mvapich-discuss] Shared memory in MPI3 - measure of memory footprint

Brandt, Benedikt B benbra at gatech.edu
Fri Jan 8 10:46:23 EST 2016


Dear mvapich community


I am currently testing the MPI-3 shared memory routines for use in our application. The goal is to reduce the memory footprint of our application per node.


The code seems to work but I get the following odd behavior when I monitor the memory usage:


TLDR: Shared memory that is "touched" (read or written) by an MPI process counts towards that process's real memory (RSS, RES) value. If every process accesses the whole shared memory (= data), the memory consumption as seen by top (or other monitoring tools) is the same as if every process had it's own copy of the data.


If we run this job on a cluster with a job scheduler and resource manager our jobs will be aborted if we expect the shared memory to count only once. So how can we work around this problem? How could a resource manager (or the operating system) correctly determine memory consumption?


=== Long version: ===


Running our code compiled with mvapich (2.1)  and ifort (15) on one node, I see the following memory footprint right after starting the program:


PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+  COMMAND

47708 bbrandt6  20   0  746m  14m 6064 R 100.0  0.0   0:22.57 exact_ddot_ene_
47707 bbrandt6  20   0  746m  14m 6164 R 100.0  0.0   0:22.56 exact_ddot_ene_
47709 bbrandt6  20   0  746m  14m 6020 R 100.0  0.0   0:22.58 exact_ddot_ene_
47710 bbrandt6  20   0  746m  14m 6056 R 100.0  0.0   0:22.55 exact_ddot_ene_
47711 bbrandt6  20   0  746m  14m 6072 R 100.0  0.0   0:22.57 exact_ddot_ene_


This is as expected since we allocate about 700mb of shared memory using MPI_Win_allocate_shared. After copying the data into the shared memory it looks like this


PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+  COMMAND

47711 bbrandt6  20   0  746m  17m 6216 R 100.0  0.0   3:01.03 exact_ddot_ene_
47708 bbrandt6  20   0  746m  17m 6212 R 99.6  0.0   2:40.07 exact_ddot_ene_
47707 bbrandt6  20   0  746m 612m 600m R 99.3  0.9   3:01.33 exact_ddot_ene_
47709 bbrandt6  20   0  746m  17m 6164 R 98.6  0.0   3:06.72 exact_ddot_ene_
47710 bbrandt6  20   0  746m  17m 6200 R 98.6  0.0   2:43.91 exact_ddot_ene_

Again just as expected, one process copied the data and has now a memory footprint of 746m VIRT and 612m RES. Now the other processes start accessing the data and we get:

PID      USER         PR  NI  VIRT    RES  SHR   S %CPU %MEM    TIME+  COMMAND
47709 bbrandt6  20   0  785m 214m 165m R 100.0  0.3   3:49.37 exact_ddot_ene_
47707 bbrandt6  20   0  785m 653m 602m R 100.0  1.0   3:43.93 exact_ddot_ene_
47708 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:23.03 exact_ddot_ene_
47710 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:26.86 exact_ddot_ene_
47711 bbrandt6  20   0  785m 214m 166m R 100.0  0.3   3:44.01 exact_ddot_ene_

which increases to 787m VIRT 653m RES for all processes once they have accessed all the data in the shared memory. So the memory footprint is just as large as if every process held it's own copy of the data. So at this point it seems like we haven't saved any memory at all. We might have gained speed and bandwith but using the shared memory did not reduce the memory footprint of our application.

If we run this job on a cluster with a job scheduler and resource manager our jobs will be aborted if we expect the shared memory to count only once. So how can we work around this problem? Is the cause of this problem that mvapich runs different processes so shared memory counts fully towards each whereas openmp runs only one process but multiple threads so the shared memory counts only once? How could a resource manager (or the operating system) correctly determine memory consumption?

=== end long version ===

Any thoughts and any comments are truly appreciated

Thanks a lot

Benedikt





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160108/7a9d26c9/attachment-0001.html>


More information about the mvapich-discuss mailing list