[mvapich-discuss] Same nodes different time?

Daniel WEI lakeat at gmail.com
Thu Mar 27 17:15:29 EDT 2014


Thank you Jonathan.

Let's forget about the node sequence for a little while. Let's say I am
using the same order of hosts in my host file and do a test. Same hostfile,
same for everything, I ran the simulation twice today, I found that results
are the same, which is good, but the time spent (measured by wall clock
time) is different! I am completely lost. What might be the culprit?

The simulation were carried out on our school high performance center,
managed by SGE system, my hosts file is written as:

a01.aaa.bbb.edu:16
a0
2
.aaa.bbb.edu:16

a0
3
.aaa.bbb.edu:16

a0
4
.aaa.bbb.edu:16

a0
5
.aaa.bbb.edu:16

a0
6
.aaa.bbb.edu:16

a0
7
.aaa.bbb.edu:16





Zhigang Wei
----------------------
*University of Notre Dame*


On Thu, Mar 27, 2014 at 3:49 PM, Jonathan Perkins <
perkinjo at cse.ohio-state.edu> wrote:

> Hi Daniel.  The order of hosts in your hostfile may impact the mapping
> of mpi ranks to which node they run on.  This mixed with your
> application's communication pattern may result in different
> performance of the application.  I believe this is what you're seeing.
>
> When using mpirun without specifying the hostfile or hosts on the
> command line your ranks will all launch on the localhost unless this
> is invoked under a slurm or other resource management environment.
>
> On Thu, Mar 27, 2014 at 3:10 PM, Daniel WEI <lakeat at gmail.com> wrote:
> > Dear List,
> >
> > Is it possible that:
> >
> > hosts (1):
> > node1:16
> > node2:16
> > node3:16
> > node4:16
> >
> > gives a different results than
> >
> > hosts (2):
> > node2:16
> > node4:16
> > node3:16
> > node1:16
> >
> > I mean does the node sequence appears in the hostfile matter (suppose
> > node1~node4 are exactly the same architecture) concerning the speed?
> >
> > If I use simple mpirun -np XX blahblah, how mvapich2 generated this node
> > sequence? Based on what? Which node is going to be the host node, which
> are
> > going to be the slave nodes? Is it a random decision?
> >
> > My simulation go different wall clock time at the end of 100 time steps
> of
> > my simulations, the residuals and everything is the same, I have checked
> of
> > the log and the results, it is just the time taken for each time step is
> > different. Is it a memory issue or oversubscription issue? Has anyone
> > experience this situation before?
> >
> >
> > Thanks a lot,
> >
> >
> >
> >
> > Zhigang Wei
> > ----------------------
> > University of Notre Dame
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140327/5c61efc5/attachment.html>


More information about the mvapich-discuss mailing list