[mvapich-discuss] Same nodes different time?

Thu Mar 27 21:37:00 EDT 2014

Yes, it is scratch file system. But even if we forget about the reading and
loading files. The time spent for each simulation step afterwards is still
different.

Zhigang Wei
----------------------
*University of Notre Dame*

On Thu, Mar 27, 2014 at 9:32 PM, Sourav Chakraborty
<chakraborty.52 at osu.edu>wrote:

> Hi Daniel,
>
>
> In my case, the "reading in" of the velocity field and pressure field
>> sometimes could be occasionally huge different (37 seconds in one case, 3
>> seconds in another case).
>>
>
> Did you mean reading the input file by that? In that case, what filesystem
> are you reading from?
>
> Sourav Chakraborty
> The Ohio State University
>
>
> On Thu, Mar 27, 2014 at 8:56 PM, Daniel WEI <lakeat at gmail.com> wrote:
>
>> Measurement is implemented in my c++ code, using "sys/times.h", for
>> example:
>>
>> start = clock();
>> ... /* Do the work. */
>> end = clock();
>> elapsed = ((double) (end - start)) / CLOCKS_PER_SEC;
>>
>> I have tried both 5~20 minutes job, as well as 0.5~3 hours job, they all
>> show differences. Let's say the first is JOB-A, the latter is JOB-B.
>> At first I was testing JOB-B, and since I found there is difference, even
>> though the hosts are the same (just the order of hosts is different). So I
>> then started to test on a smaller job, that is JOB-A today, and I fixed the
>> order of hosts by manually create a hostfile, and then I found even with
>> the same order of hosts, the results are still different.
>>
>> I don't understand what did you mean by saying "warm up"
>> "startup/wrapup", etc. In my case, the "reading in" of the velocity field
>> and pressure field sometimes could be occasionally huge different (37
>> seconds in one case, 3 seconds in another case).
>>
>> I guess Tony's point makes sense, that the problem is in switches. But I
>> am not sure.
>>
>>
>>
>>
>>
>> Zhigang Wei
>> ----------------------
>> *University of Notre Dame*
>>
>>
>> On Thu, Mar 27, 2014 at 7:58 PM, Gus Correa <gus at ldeo.columbia.edu>wrote:
>>
>>> On 03/27/2014 05:58 PM, Daniel WEI wrote:
>>>
>>>>
>>>> On Thu, Mar 27, 2014 at 5:45 PM, Tony Ladd <tladd at che.ufl.edu
>>>> <mailto:tladd at che.ufl.edu>> wrote:
>>>>
>>>>     So your performance can vary depending on what else is going on with
>>>>     the other nodes in the system
>>>>
>>>>
>>>> Thank you Tony. I see.
>>>>
>>>> (1) But how much variance?! My results shows some very disturbing
>>>> difference, on one case, to initialize the case, it takes 37s, another
>>>> 5s, yet another 2s!!!
>>>> (2) How can I do my best, or somebody else to do their best, in order to
>>>> reduce this variance? (there is 16 cores/node, so there should be nobody
>>>> using the nodes I was calling, this seems to be guaranteed)
>>>> (3) I goal is to compare intel compiler's -O3 and -O2 difference on
>>>> building my CFD code concerning speed, but now if my performance vary
>>>> even in the same case, same hosts, how can I trust my results
>>>> anymore....?
>>>> Zhigang Wei
>>>> ----------------------
>>>> /University of Notre Dame/
>>>>
>>>>
>>> Hi Zhigang
>>>
>>> What time are you measuring?
>>> Wall time from the job scheduler for the whole job?
>>> Wall time for the application only (say with Unix time utility or
>>> MPI_Wtime)?
>>> Something else?
>>>
>>> Have you tried to run your test simulations for a longer time (several
>>> minutes, one hour perhaps, not just a few seconds)
>>> to see if the outcome shows less spread?
>>> Say, you could change the number of time steps to 100x
>>> or perhaps 10,000x what you are currently using,
>>> depending of course on the max walltime allowed by your cluster queue.
>>>
>>> My wild guess is that with short-lived simulations
>>> what may count most is the job or application
>>> startup and wrapup times, which may vary significantly in a cluster,
>>> specially in a big cluster, overwhelming and obscuring your program
>>> execution time.
>>> Most MPI and benchmark implementations recommend
>>> that you "warm up" your own tests/benchmarks
>>> for a time long enough to reduce such startup/wrapup effects.
>>>
>>> My two cents,
>>> Gus Correa
>>>
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140327/dccef03f/attachment-0001.html>