[mvapich-discuss] Same nodes different time?

Thu Mar 27 21:53:01 EDT 2014

Hi Hari,

Thank you very much!

Below is a copy of a report I wrote today (just for my personal note), it
has details you want to know. To answer your other questions:
(1) File I/O is only done once, file reading only happen once at the very
beginning. File writing happen once at the end of my simulation. These two
time, I can take them into account when I measure the time spent for each
simulation step.
(2) So, there is difference in time needed for each simulation step, most
noticeable is the 1st time step, might be a big differnce. But afterwards,
there is also difference, the averaging difference is given also below.
(3) The code is OpenFOAM, an open source CFD code. You can get it whenever
you want, but the compilation is very time consuming. :)

BEGIN------------------------------------------------------

Same case, I run it five times.

Each time the hosts are (here are the number XX means: d8civy0XX.crc.nd.edu
):

Simulation-1: 19 16 05 09 02 15 20 14 04 06 11 17 07 03 08 12 01 10 13 18

Simulation-2: 07 03 08 12 01 10 13 18 19 16 05 09 02 15 20 14 04 06 11 17

Simulation-3: 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20

Simulation-4: 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20

Simulation-5: 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20

As you can see the computing nodes/hosts used are exactly the same (from 01
to 20, that is using 20 nodes, 320 cores), but the order of hosts is
different in simulation-1 and simulation-2. Simulation 3-5 have the same
order of hosts (by manually set the hostfile for mvapich2). The host file
used is:

d8civy001.crc.nd.edu:16

d8civy002.crc.nd.edu:16

d8civy003.crc.nd.edu:16

d8civy004.crc.nd.edu:16

d8civy005.crc.nd.edu:16

d8civy006.crc.nd.edu:16

d8civy007.crc.nd.edu:16

d8civy008.crc.nd.edu:16

d8civy009.crc.nd.edu:16

d8civy010.crc.nd.edu:16

d8civy011.crc.nd.edu:16

d8civy012.crc.nd.edu:16

d8civy013.crc.nd.edu:16

d8civy014.crc.nd.edu:16

d8civy015.crc.nd.edu:16

d8civy016.crc.nd.edu:16

d8civy017.crc.nd.edu:16

d8civy018.crc.nd.edu:16

d8civy019.crc.nd.edu:16

d8civy020.crc.nd.edu:16

The case loading time are (all the time measurement are implemented in the
C++ code):

0.59s CPU time and 1s wall time

1.50s CPU time and 3s wall time

35.7s CPU time and 37s wall time

2.29s CPU time and 5s wall time

0.72s CPU time and 2s wall time

So even with the same order of hosts, the case loading time are different.
The host node is all node 01, 02~20 are all slave nodes. At first I guess,
it might be that the memory of the host node 01 needs to be fully cleaned
up in order to give the good loading time? Or it might be because the node
01 is actually taken by another program at that time which I don't know? It
is not clear how the order of hosts is set. Is the order of hosts generated
by mpi randomly?

The clock time of the simulation's 1st time step and 50th time step are:

49s for the 1st step and 1067s for the 50th step (20.78s for each of the
last 49 steps)

35s for the 1st step and 1047s for the 50th step (20.65s for each of the
last 49 steps)

69s for the 1st step and 1082s for the 50th step (20.67s for each of the
last 49 steps)

36s for the 1st step and 1052s for the 50th step (20.73s for each of the
last 49 steps)

48s for the 1st step and 1082s for the 50th step (21.10s for each of the
last 49 steps)

Notice even though the time spent for each step in their average are
roughly the same, but longer simulation would still possibly make a
non-negligible difference. What's disturbing is the last three simulation
which have the same order of hosts, still show difference. There is
(21.10-20.67)/20.67=2.08% difference in averaged time step between
simulation 3 and simulation 5. The actual time cost for the first time step
could be calculated by 49-1=48s, 35-3=32s, 69-37=32s, 36-5=31s and
48-2=46s. So except the first and the last simulation, they are roughly the
same. The good news is that all the simulation results are the same (by use
vimdiff to check the log difference, which contains enough information of
residuals and time step continuity errors, etc.), in other words, the final
residuals and time step continuity errors after 50 time steps are still the
same at the end of simulation. Simulation 3 is killed as soon as the last
time step is finished so that simulation 4 could be started more quickly.
It is not clear if a fully clean exit would clean up the memory better.
Neither is it clear if this is a memory issue or over-subscription issue or
because the network is not flat. But since simulation 5 starts when
simulation 4 gives a clean exit, so this theory is ruled out.

END------------------------------------------------------

Zhigang Wei
----------------------
*University of Notre Dame*

On Thu, Mar 27, 2014 at 9:44 PM, Hari Subramoni <subramoni.1 at osu.edu> wrote:

> Hello Daniel,
>
> I am moving this conversation from mvapich-discuss to our internal
> developer list. It will be easier to debug it further within the MVAPICH
> group. Thus I would appreciate it if you could reply to this chain.
>
> Do you read the file in each step? If you ignore the file reading part,
> how much does the time taken per step vary by? Do you still see the 37 to 3
> second difference per step?
>
> Do you have a sample version of your code that you can share with us?
>
> Regards,
> Hari.
>
>
> On Thu, Mar 27, 2014 at 9:37 PM, Daniel WEI <lakeat at gmail.com> wrote:
>
>> Yes, it is scratch file system. But even if we forget about the reading
>> and loading files. The time spent for each simulation step afterwards is
>> still different.
>>
>>
>>
>>
>>
>> Zhigang Wei
>> ----------------------
>> *University of Notre Dame*
>>
>>
>> On Thu, Mar 27, 2014 at 9:32 PM, Sourav Chakraborty <
>> chakraborty.52 at osu.edu> wrote:
>>
>>> Hi Daniel,
>>>
>>>
>>> In my case, the "reading in" of the velocity field and pressure field
>>>> sometimes could be occasionally huge different (37 seconds in one case, 3
>>>> seconds in another case).
>>>>
>>>
>>> Did you mean reading the input file by that? In that case, what
>>> filesystem are you reading from?
>>>
>>> Sourav Chakraborty
>>> The Ohio State University
>>>
>>>
>>> On Thu, Mar 27, 2014 at 8:56 PM, Daniel WEI <lakeat at gmail.com> wrote:
>>>
>>>> Measurement is implemented in my c++ code, using "sys/times.h", for
>>>> example:
>>>>
>>>> start = clock();
>>>> ... /* Do the work. */
>>>> end = clock();
>>>> elapsed = ((double) (end - start)) / CLOCKS_PER_SEC;
>>>>
>>>> I have tried both 5~20 minutes job, as well as 0.5~3 hours job, they
>>>> all show differences. Let's say the first is JOB-A, the latter is JOB-B.
>>>> At first I was testing JOB-B, and since I found there is difference,
>>>> even though the hosts are the same (just the order of hosts is different).
>>>> So I then started to test on a smaller job, that is JOB-A today, and I
>>>> fixed the order of hosts by manually create a hostfile, and then I found
>>>> even with the same order of hosts, the results are still different.
>>>>
>>>> I don't understand what did you mean by saying "warm up"
>>>> "startup/wrapup", etc. In my case, the "reading in" of the velocity field
>>>> and pressure field sometimes could be occasionally huge different (37
>>>> seconds in one case, 3 seconds in another case).
>>>>
>>>> I guess Tony's point makes sense, that the problem is in switches. But
>>>> I am not sure.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Zhigang Wei
>>>> ----------------------
>>>> *University of Notre Dame*
>>>>
>>>>
>>>> On Thu, Mar 27, 2014 at 7:58 PM, Gus Correa <gus at ldeo.columbia.edu>wrote:
>>>>
>>>>> On 03/27/2014 05:58 PM, Daniel WEI wrote:
>>>>>
>>>>>>
>>>>>> On Thu, Mar 27, 2014 at 5:45 PM, Tony Ladd <tladd at che.ufl.edu
>>>>>> <mailto:tladd at che.ufl.edu>> wrote:
>>>>>>
>>>>>>     So your performance can vary depending on what else is going on
>>>>>> with
>>>>>>     the other nodes in the system
>>>>>>
>>>>>>
>>>>>> Thank you Tony. I see.
>>>>>>
>>>>>> (1) But how much variance?! My results shows some very disturbing
>>>>>> difference, on one case, to initialize the case, it takes 37s, another
>>>>>> 5s, yet another 2s!!!
>>>>>> (2) How can I do my best, or somebody else to do their best, in order
>>>>>> to
>>>>>> reduce this variance? (there is 16 cores/node, so there should be
>>>>>> nobody
>>>>>> using the nodes I was calling, this seems to be guaranteed)
>>>>>> (3) I goal is to compare intel compiler's -O3 and -O2 difference on
>>>>>> building my CFD code concerning speed, but now if my performance vary
>>>>>> even in the same case, same hosts, how can I trust my results
>>>>>> anymore....?
>>>>>> Zhigang Wei
>>>>>> ----------------------
>>>>>> /University of Notre Dame/
>>>>>>
>>>>>>
>>>>> Hi Zhigang
>>>>>
>>>>> What time are you measuring?
>>>>> Wall time from the job scheduler for the whole job?
>>>>> Wall time for the application only (say with Unix time utility or
>>>>> MPI_Wtime)?
>>>>> Something else?
>>>>>
>>>>> Have you tried to run your test simulations for a longer time (several
>>>>> minutes, one hour perhaps, not just a few seconds)
>>>>> to see if the outcome shows less spread?
>>>>> Say, you could change the number of time steps to 100x
>>>>> or perhaps 10,000x what you are currently using,
>>>>> depending of course on the max walltime allowed by your cluster queue.
>>>>>
>>>>> My wild guess is that with short-lived simulations
>>>>> what may count most is the job or application
>>>>> startup and wrapup times, which may vary significantly in a cluster,
>>>>> specially in a big cluster, overwhelming and obscuring your program
>>>>> execution time.
>>>>> Most MPI and benchmark implementations recommend
>>>>> that you "warm up" your own tests/benchmarks
>>>>> for a time long enough to reduce such startup/wrapup effects.
>>>>>
>>>>> My two cents,
>>>>> Gus Correa
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140327/b2bf714a/attachment-0001.html>