[mvapich-discuss] mvapich & mpiexec

Tue May 29 15:24:21 EDT 2007

koop at cse.ohio-state.edu wrote on Tue, 29 May 2007 10:42 -0400:
[pw said:]
> > Before I check it in, can you provide some commentary on the reason
> > for the two sets of accept(), read(), write() for every task?  On
> > the surface this would seem like a major barrier to scalability.
> > There must be a good reason to get the hostids out first, then have
> > the tasks come back for the addresses.
> 
> The reason here to allow extended functionality. Information
> about process/host pairs are exchanged prior to QP setup in order to allow
> QPs to be setup in efficient ways. In particular, if multiple HCAs are
> available, QPs should utilize them efficiently. Also, knowing the ranks on
> a single node allows better use of multiple paths through the network --
> all of this is required before QP setup.
> 
> Scalability has not been a problem and has been extended to multi-thousand
> processes.

Okay, I'll add something about multi-NIC support, for the curious.
But you haven't really explained why you need to disconnect the TCP
socket after exchanging hostids, then have everybody reconnect
again.  Perhaps you are changing the ranks of processes on a
particular node once they receive the hostid information?  Otherwise
it would seem just as easy to send all the information across the
same socket:  hostids, addresses, and pids.

Experiments with mpiexec and mvapich1 on the 8000-processor Sandia
cluster showed that PBS executable startup was the slowest part of
job launch, followed closely by the all-to-one communication
required during job startup.  Each of the 8000 nodes does a full
SYN-SYN-ACK TCP setup with the single mpiexec (or mpirun_rsh)
process, all at roughly the same time.  Ethernet packet losses cause
significant delays.  Adding another set of FIN-ACK-FIN-ACK and
SYN-SYN-ACK handshake messages to the startup will make things even
slower, especially as the processes have become fully synchronized
after the first round, exacerbating congestion problems.

Just some words to point out how you should expect this change to be
visible to high-end users.  For small clusters this effect should
not be noticable.  But most importantly, things do work again.

		-- Pete