[mvapich-discuss] Optimization flags for large messages and asynchronous progress

Fri Jan 4 01:44:04 EST 2013

Hi,

I'm a user of sdsc trestles (a cluster of SMP nodes, AMD Magny cores, 32
core per node, 8 cores per socket, system specs are listed here:
http://www.sdsc.edu/us/resources/trestles/) . I'm having some performance
issues with a  code that is used for turbulent channel flow simulations.
After analysing the performance I have found the bottleneck is in the
communications. In the code I have tried to overlap the communications with
the computations, however it seems that the asynchronous progress is not
supported with the mpi library. The default mpi implementation on trestles
is mvapich2/1.5.1p1 (which I have used it for the tests) however
mvapich2/1.7 also is accessible. Is it possible for you to help me? is
there any flag that I'm missing? can you please help me to select the
optimized flags for my application?

Currently in the code I use persistent communication in ready mode
(MPI_Rsend_init) and in between the MPI_Start_all and MPI_Wait_all bulk of
the computation is done. The code looks like this:

void main(int argc, char *argv[]){
  some definitions;
  some initializations;

  MPI_Init(&argc, &argv);

  MPI_Rsend_init( channel to the rank before );
  MPI_Rsend_init( channel to the rank after );
  MPI_Recv_init( channel to the rank before );
  MPI_Recv_init( channel to the rank after );

  for (timestep=0; temstep<Time; timestep++)
  {
    prepare data for send;
    MPI_Start_all();

    do computations;

    MPI_Wait_all();

    do work on the received data;
  }
  MPI_Finalize();}

Unfortunately the actual data transfer does not start until the
computations are done, I don't understand why. The network uses QDR
InfiniBand Interconnect, each message size is 23MB (totally 46 MB message
is sent, one 23MB to the next rank and one to the previous), ranks all are
in a loop (i.e. ranke N communicates with rank 1 and N-1) and I need to
increase the size of the problem to extend my studies which means messages
as large as 92MB

Currently I use the following flags:
VIADEV_RNDV_PROTOCOL=ASYNC
MV2_SMP_EAGERSIZE=46M
MV2_CPU_BINDING_LEVEL=socket
MV2_CPU_BINDING_POLICY=bunch

Can you please help me to select the flags? Is there a set of optimum flags
for such applications? Also, should I use persistent communication or I
should use MPI_Isend and a regular MPI_Recv or MPI_Send and MPI_Irecv or
MPI_Isend and MPI_Irecv? Is there a specific combination for which the
asynchronous progress  works? Do you recommend I create a thread that just
repeatedly calls MPI_Test on each core?

Thank you very much,
Amirreza
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130104/37af2c87/attachment-0001.html