[mvapich-discuss] truly one-sided communication

Wed Jan 19 09:22:12 EST 2011

Dear Thiago,

Thanks for your email. Currently, MVAPICH2 supports truly one-sided
communication for the active modes of synchronization (Fence and
Post-Wait/Start-Complete) available in MPI-2 RMA. The truly one-sided
designs for passive mode are in the pipeline for the next (1.7) release of
MVAPICH2.

Thank you
Sreeram Potluri

On Wed, Jan 19, 2011 at 2:41 AM, Thiago Ize <thiago at sci.utah.edu> wrote:

>  Hi,
> I was under the impression that mvapich2 used truly one-sided communication
> when doing a passive MPI_Get.  At least this is my understanding from
> reading "Efficient Implementation of MPI-2 Passive One-Sided Communication
> on InfiniBand Clusters."  Does mvapich2 still use a thread-based design or
> is it using the atomics? Is there some way to get mvapich2 to not depend on
> the target process being able to respond to a passive MPI_Get?
>
> Here's my problem.  I need to do very low latency mpi_gets very frequently
> from lots of nodes to lots of other nodes over Infiniband, but occasionally
> some of the target processes from which I need to remotely read memory from
> will have extensive computation to do or will be sleeping in a
> mutex/barrier, so they will not do any mpi calls and will not be able to
> progress the mpi engine.  When this occurs performance plummets across all
> the nodes (a 10us MPI_Get becomes 10s).
> My program requires multi-threading support, but I can reproduce this issue
> with serial code as well and have attached an example program that shows
> this.  The program has each process perform an MPI_Get on rank 0's data.  I
> added some superfluous computation that is performed only by rank 0. Here's
> the output I get:
> $ mpirun_rsh -np 2 node1 node2 one_sided
> performing 1 non-mpi work for rank 0
> rank 0 did 79.041 Gb/s in 0.051 seconds
> rank 1 did 8.667 Gb/s in 0.462 seconds
> performing 10 non-mpi work for rank 0
> rank 0 did 2.641 Gb/s in 1.515 seconds
> rank 1 did 2.077 Gb/s in 1.925 seconds
> performing 100 non-mpi work for rank 0
> rank 0 did 1.336 Gb/s in 2.993 seconds
> rank 1 did 1.178 Gb/s in 3.397 seconds
> performing 1000 non-mpi work for rank 0
> rank 0 did 0.867 Gb/s in 4.611 seconds
> rank 1 did 0.802 Gb/s in 4.985 seconds
> performing 10000 non-mpi work for rank 0
> rank 0 did 0.522 Gb/s in 7.657 seconds
> rank 1 did 0.499 Gb/s in 8.016 seconds
>
> What I would have expected is that rank 0 gets slower since it becomes
> compute bound, but rank 1 should stay just as fast.  Instead they both get
> slower together.
>
> Here's what I used:
> $mpiname -a
> MVAPICH2 1.6rc2 2010-12-23 ch3:mrail
>
> Compilation
> CC: gcc  -DNDEBUG -O2
> CXX: c++  -DNDEBUG -O2
> F77: gfortran  -DNDEBUG
> F90:   -DNDEBUG
>
> Configuration
> --prefix=/home/sci/thiago/apps/linux64/mvapich2-1.6-rc2-ch3
> --enable-sharedlibs=gcc --enable-threads --enable-languages=c,c++
> --disable-f77 --disable-f90 --enable-error-checking=no --enable-fast=all
>
> Is there anything I could do to get this to scale?  Did I do something
> wrong?
>
> Thanks,
> Thiago
>
> #include <mpi.h>
> #include <iostream>
> #include <stdio.h>
> #include <cmath>
> using namespace std;
>
> int main(int argc, char* argv[]) {
>
>  int provided=-1;
>  int requested = MPI_THREAD_SINGLE;//MPI_THREAD_MULTIPLE;
>  MPI_Init_thread(&argc, &argv, requested, &provided);
>
>  const int rank = MPI::COMM_WORLD.Get_rank();
>
>  const unsigned long NUM_GETS = 10000; // something big
>  const unsigned long BUFFER_SIZE = 2048*32;
>  char* sendBuffer = new char[BUFFER_SIZE];
>
>  MPI::Win win = MPI::Win::Create(sendBuffer, BUFFER_SIZE, 1, MPI_INFO_NULL,
> MPI_COMM_WORLD);
>
>
>  double startTime = MPI::Wtime();
>
>  char* receiveBuffer = new char[BUFFER_SIZE];
>
>  for (int j=0; j < 5; ++j) {
>    const int repeat = (int)pow(10.0,j);
>    if (rank == 0)
>      printf("performing %d non-mpi work for rank 0\n", repeat);
>    for (unsigned long i = 0; i < NUM_GETS; ++i) {
>
>      if (rank == 0) {
>        for (int i=0; i < repeat; ++i)
>          if (sqrt(float(i))== -3)
>            printf("did some computation that will never be used.\n");
>      }
>
>      int owner = 0;
>      int err = MPI_Win_lock(MPI_LOCK_SHARED, owner, 0, win);
>      if (err != MPI_SUCCESS) cout << "ERROR lock: " <<err<<endl;
>
>      err = MPI_Get(receiveBuffer, BUFFER_SIZE, MPI_CHAR,
>                    owner, 0, BUFFER_SIZE, MPI_CHAR, win);
>      if (err != MPI_SUCCESS) cout << "ERROR get: " <<err<<endl;
>
>      err = MPI_Win_unlock(owner, win);
>      if (err != MPI_SUCCESS) cout << "ERROR unlock: " <<err<<endl;
>    }
>
>    double endTime = MPI::Wtime();
>    double sec = (endTime-startTime);
>    unsigned long bitsGot = 1*8*BUFFER_SIZE*NUM_GETS;
>    float GbitsGot = bitsGot/1073741824;
>    printf("rank %d did %.3lf Gb/s in %.3lf seconds \n",rank,
> (GbitsGot/sec), sec);
>
>  MPI_Barrier(MPI_COMM_WORLD);
>  sleep(1); // so that the output is in sync and looks pretty
>  }
>  MPI_Win_free(win);
>
>  MPI::Finalize();
>  return 0;
> }
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110119/550afa07/attachment-0001.html