[mvapich-discuss] Optimised MPI features

Fri Apr 18 19:50:22 EDT 2008

Dear All,
   I have recently returned from a trip to a Cray workshop where
we discussed which features of MPI are optimized. As it turns out,
they like to have receives pre-posted in which case messages are
transferred to directly to the memory space of our application.
Also very short messages (<1Kb) go through a 'fast path' by being directly
included in the message headers.

Our applications use the 'MPI_Request' paradigm to pre-declare
persistent communications. We declare the requests with MPI_Send_init
and MPI_recv_init and then start and finish them multiple times
with MPI_Startall & MPI_Waitall. As it turns out MPI_Send_init
and MPI_Recv_init do not pre-post the communications. The
posting apparently only happens when we call MPI_Startall.
Thus we can fall in the situation, that if processes are not
tightly synchronized, that one process may post the sends,
before others have pre-posted receives, even tho the receives
are always declared before the sends. A rejigging of the
comms pattern, to always start the receives at the beginning
of the calculation helps if the local problem is big enough
but for the minimal problem (strong scaling) the local compute
is small enough that the MPI_Startalls are sufficiently unsynchronised
so that the receives do not get pre-posted, although the
declarations of the receives always preced the declaration of the sends.

I would like to ask the developers whether this approach of using
MPI_Startall and MPI_Waitall is well optimized in Mvapich, or whether we'd be 
better off usin MPI_Isend / MPI_Irecv pairs.
As it happens, with MVAPICH 1.0.0 over infiniband we see poor scaling on Ranger 
and our own IB cluster beyond 128-256 cores - tho scaling seems near perfect up 
to the 128 cores. Other features we use which may be impacting us
includes:

- duplicating MPI_COMM_WORLD with MPI_Comm_dup to make
a new communicator which we call QMP_COMM_WORLD and then
using QMP_COMM_WORLD thereafter. On the Cray XT and IBM BG/P
it appears that MPI_COMM_WORLD is 'special' and it knows
that all MPI tasks communicate. Is MPI_Comm_world special
in MVAPICH ?

- Using MPI_Cart_create to try and get a virtual topology
assuming that it will generate a topology that is somehow
close to the machine. On Infiniband (which is a switch or
tree of switches) I wonder whether this approach is efficient.
I have heard, that on the Crays it is not optimal -- users
don't have access to the physical topology and the
MPI_Cart_create may return a topology unrelated to the machine
(even tho it is a 3D mesh/torus underneath physically if not
in terms of the current job).

- Recent extensions to our comms interface are considering
the use of single ended (one-sided) comms primitives from
MPI2 (mpich2). Would the developers care to express an opinion
on this issue?

- Finally: Our problem is a closely coupled stencil like
calculation. The reason for using the 'MPI_Request' paradigm
is to pre-declare (preferably pre-post) an asynchronous communications
pattern to essentially communicate our halos. We can then
overlap the communication (Initiated by MPI_Startall) with
computation which on finishing the computation we finalize
with MPI_Waitall.

I would very much appreciate views from the developers regarding
these issues. We'd like to raise our performance on our own
IB clusters as well as on Ranger or at least to improve our scaling
beyond 256 cores. On Ranger our target is at least 2048 cores.
We  wonder whether our paradigm of using MPI_Requests
is in some way inhibiting our performance.
I look forward with thanks to comments on this issue.

With my very best wishes,
  	Balint

-- 
-------------------------------------------------------------------
Dr Balint Joo              High Performance Computational Scientist
Jefferson Lab
12000 Jefferson Ave, Mail Stop 12B2, Room F217,
Newport News, VA 23606, USA
Tel: +1-757-269-5339,      Fax: +1-757-269-5427
email: bjoo at jlab.org       (old email: bj at ph.ed.ac.uk)
-------------------------------------------------------------------