[mvapich-discuss] Job start-up time

Pavan Balaji balaji at mcs.anl.gov
Fri Jul 2 11:30:00 EDT 2010


One of the use cases of getting the address information is to identify 
which processes are local and which are not. So, till you get that from 
the process manager (using the current O(N**2) algorithm), you don't 
know which processes are local. Don't we love chicken-and-egg problems :-).

The root cause of this issue is that the older versions of MVAPICH2 are 
using PMI-1.0 internally. MVAPICH2-1.5.x is based on MPICH2-1.2.1p1, 
which has PMI-1.1 support. That should bring this time down to constant 
time. (though I've not checked the MVAPICH2-1.5rc2 code to see if it is 
in fact taking advantage of PMI-1.1).

The process manager (PM) will need to support it too. The Hydra PM does, 
but from a quick grep it looks like the mpirun_rsh PM doesn't yet 
support it.

  -- Pavan (MPICH2 team)

On 07/02/2010 03:42 AM, TJC Ward wrote:
> For the various start methods I am aware of, MVAPICH2 job start appears 
> to take a time which varies as the square of the number of ranks. My 
> team is looking at the possibility of using MVAPICH2 for initiating 
> 'lightweight' jobs on 'large' clusters, such that an O(N**2) startup 
> would be inconvenient. Launch by 'slurm' is the most efficient of the 
> ways I have tried, but that still appears to have an O(N**2) component.
> 
> As far as I can see, the O(N**2) component happens because all nodes ask 
> for addressing information about all other nodes (IP address and port); 
> these questions go across PMI on the requesting node; and all get 
> answered by the 'root' node somewhere. This leads to the 'root' node 
> having O(N**2) operations to perform.
> 
> I think there might be a more scalable algorithm; once you make contact 
> with a node in the job, you could ask it 'can you tell me the IP address 
> and port of your neighour'; usually it would know this already, and be 
> able to reply without consulting the 'root' node. I think this would 
> involve the mvapich library asking its peer directly, and only asking 
> across PMI if the peer didn't know.
> 
> Is anyone else interested in a lighter-weight start ? Is anyone working 
> on it ?
> *T J (Chris) Ward, IBM Research
> Scalable Data-Centric Computing - Active Storage Fabrics - IBM System 
> BlueGene
> IBM United Kingdom Ltd., Hursley Park, Winchester, Hants, SO21 2JN
> 011-44-1962-818679
> IBM Intranet **_http://hurgsa.ibm.com/~tjcw/_**
> **_
> _**_IBM System BlueGene Research_* 
> <http://www.research.ibm.com/bluegene/>* **_
> _**_IBM System BlueGene Marketing_* 
> <http://www-03.ibm.com/systems/deepcomputing/bluegene/>*
> **_
> _**_IBM Resources for Global Servants_* 
> <http://hurgsa.ibm.com/~tjcw/Compete/>* **_
> _**_IBM Branded Products_* <http://www.ibm.com/shop>* **_IBM Branded 
> Swag_* <http://logogear.americanid.com/>* * 	
> <http://www.ibm.com/developerworks/linux/library/l-bluegene/index.html>
> 
> 
> UNIX in the Cloud - Find A Place Where There's Room To Grow, with the 
> original Open Standard. _Free Trial Here Today_ 
> <http://sdf.lonestar.org/index.cgi?telnet>
> New Lamps For Old - _Diskless Remote Boot Linux_ 
> <http://drbl.sourceforge.net/> from _National Center for 
> High-Performance Computing, Taiwan_ <http://www.nchc.org.tw/en/>
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mvapich-discuss mailing list