[mvapich-discuss] MVAPICH 2 Progress Code improvement for RDMA_FAST_PATH

Fri Mar 23 10:27:04 EDT 2007

Hi all,

[ADAPTIVE_]RDMA_FAST_PATH is an optimization to provide low latency on 
mvapich2. The issue is, latency increases as the number of total processes 
grows. Finally, when you launch a job with over 32 processes, latency is 
worse than the standard send/recv protocol.

The reason for that is very simple. Contrary to the send/recv protocol 
which gets its receives in a single completion queue, the RDMA fast path 
has to poll _every_ RDMA queue to find out from which queue to receive 
data.

My first try to improve that was to poll only on the VCs associated to 
requests passed to MPID_Progress. That didn't work well because 
unfortunately, well-written MPI applications are scarce, and calling 
MPI_Wait on the wrong request resulted in a deadlock.

My second try is a lot better. The RDMA polling set is now restrained to :
  * VCs on which we have waiting posted receives;
  * VCs on which we have a rendez-vous send in progress.
.. and it seems to work fine and quickly, since polling is quite always 
directed to the right VC.

Has anyone already a good (better) solution for that ? Am I totally 
mistaken in my understanding of the MVAPICH 2 code ? If I'm not, I will 
consider cleaning things and proposing a patch against 0.9.8, unless I 
should wait until 0.9.9 ?

Thanks in advance for your opinions/comments/flames on that,

Sylvain