[mvapich-discuss] Same nodes different time?

Thu Mar 27 21:53:13 EDT 2014

On 03/27/2014 08:56 PM, Daniel WEI wrote:
> Measurement is implemented in my c++ code, using "sys/times.h", for example:
>
> start = clock();
> ... /* Do the work. */
> end = clock();
> elapsed = ((double) (end - start)) / CLOCKS_PER_SEC;
>

In MPI programs this is in general done using
double start, end;
...
start=MPI_Wtime();
... /* work */ ...
end=MPI_Wtime();

> I have tried both 5~20 minutes job, as well as 0.5~3 hours job, they all
> show differences. Let's say the first is JOB-A, the latter is JOB-B.
> At first I was testing JOB-B, and since I found there is difference,
> even though the hosts are the same (just the order of hosts is
> different). So I then started to test on a smaller job, that is JOB-A
> today, and I fixed the order of hosts by manually create a hostfile, and
> then I found even with the same order of hosts, the results are still
> different.

How about the relative variation?
Say, (max_job_time - min_job_time)/average_job_time, perhaps expressed 
as percent.
Does it stay the same when you go from short to long jobs,
or does it decrease?
That is a metric that should decrease with longer jobs, I suppose.
Otherwise you may suspect other problems (like the influence of
network topology, etc).

>
> I don't understand what did you mean by saying "warm up"
> "startup/wrapup", etc. In my case, the "reading in" of the velocity
> field and pressure field sometimes could be occasionally huge different
> (37 seconds in one case, 3 seconds in another case).
>

If your program is IO-intensive the variation in performance may
be impacted by a busy NFS server or similar (assuming you are not doing 
MPI-IO).
Can you perhaps isolate the IO and measure only the total time minus IO?
Say, by doing all the IO in the begining, and start measuring time after 
that?
Or do you do IO at every time step? [Not very efficient, but sometimes 
not avoidable either.]

> I guess Tony's point makes sense, that the problem is in switches. But I
> am not sure.
>

Yes, network topology, switches, heterogeneous hardware in the nodes (if 
that happens to be the case), defective nodes, possible leftover 
processes from previous jobs (that may not have cleaned up),
could contribute negatively.

Gus Correa

>
>
>
>
> Zhigang Wei
> ----------------------
> /University of Notre Dame/
>
>
> On Thu, Mar 27, 2014 at 7:58 PM, Gus Correa <gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>> wrote:
>
>     On 03/27/2014 05:58 PM, Daniel WEI wrote:
>
>
>         On Thu, Mar 27, 2014 at 5:45 PM, Tony Ladd <tladd at che.ufl.edu
>         <mailto:tladd at che.ufl.edu>
>         <mailto:tladd at che.ufl.edu <mailto:tladd at che.ufl.edu>>> wrote:
>
>              So your performance can vary depending on what else is
>         going on with
>              the other nodes in the system
>
>
>         Thank you Tony. I see.
>
>         (1) But how much variance?! My results shows some very disturbing
>         difference, on one case, to initialize the case, it takes 37s,
>         another
>         5s, yet another 2s!!!
>         (2) How can I do my best, or somebody else to do their best, in
>         order to
>         reduce this variance? (there is 16 cores/node, so there should
>         be nobody
>         using the nodes I was calling, this seems to be guaranteed)
>         (3) I goal is to compare intel compiler's -O3 and -O2 difference on
>         building my CFD code concerning speed, but now if my performance
>         vary
>         even in the same case, same hosts, how can I trust my results
>         anymore....?
>         Zhigang Wei
>         ----------------------
>         /University of Notre Dame/
>
>
>     Hi Zhigang
>
>     What time are you measuring?
>     Wall time from the job scheduler for the whole job?
>     Wall time for the application only (say with Unix time utility or
>     MPI_Wtime)?
>     Something else?
>
>     Have you tried to run your test simulations for a longer time
>     (several minutes, one hour perhaps, not just a few seconds)
>     to see if the outcome shows less spread?
>     Say, you could change the number of time steps to 100x
>     or perhaps 10,000x what you are currently using,
>     depending of course on the max walltime allowed by your cluster queue.
>
>     My wild guess is that with short-lived simulations
>     what may count most is the job or application
>     startup and wrapup times, which may vary significantly in a cluster,
>     specially in a big cluster, overwhelming and obscuring your program
>     execution time.
>     Most MPI and benchmark implementations recommend
>     that you "warm up" your own tests/benchmarks
>     for a time long enough to reduce such startup/wrapup effects.
>
>     My two cents,
>     Gus Correa
>
>
>
>     _________________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-__state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.__edu/mailman/listinfo/mvapich-__discuss
>     <http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss>
>
>