[mvapich-discuss] truly one-sided communication

Thiago Ize thiago at sci.utah.edu
Thu Jul 28 19:48:41 EDT 2011


Hi,

I just tried 1.7 rc1 and my application still seems to have the same 
issue.  Was truly one-sided communication for passive mode actually 
introduced to 1.7 and if so, do I need to do anything special to make 
use of it?

Thanks,
Thiago

sreeram potluri wrote:
> Dear Thiago,
>
> Thanks for your email. Currently, MVAPICH2 supports truly one-sided 
> communication for the active modes of synchronization (Fence and 
> Post-Wait/Start-Complete) available in MPI-2 RMA. The truly one-sided 
> designs for passive mode are in the pipeline for the next (1.7) 
> release of MVAPICH2. 
>
> Thank you
> Sreeram Potluri
>
> On Wed, Jan 19, 2011 at 2:41 AM, Thiago Ize <thiago at sci.utah.edu 
> <mailto:thiago at sci.utah.edu>> wrote:
>
>     Hi,
>     I was under the impression that mvapich2 used truly one-sided
>     communication when doing a passive MPI_Get.  At least this is my
>     understanding from reading "Efficient Implementation of MPI-2
>     Passive One-Sided Communication on InfiniBand Clusters."  Does
>     mvapich2 still use a thread-based design or is it using the
>     atomics? Is there some way to get mvapich2 to not depend on the
>     target process being able to respond to a passive MPI_Get?
>
>     Here's my problem.  I need to do very low latency mpi_gets very
>     frequently from lots of nodes to lots of other nodes over
>     Infiniband, but occasionally some of the target processes from
>     which I need to remotely read memory from will have extensive
>     computation to do or will be sleeping in a mutex/barrier, so they
>     will not do any mpi calls and will not be able to progress the mpi
>     engine.  When this occurs performance plummets across all the
>     nodes (a 10us MPI_Get becomes 10s). 
>     My program requires multi-threading support, but I can reproduce
>     this issue with serial code as well and have attached an example
>     program that shows this.  The program has each process perform an
>     MPI_Get on rank 0's data.  I added some superfluous computation
>     that is performed only by rank 0. Here's the output I get:
>     $ mpirun_rsh -np 2 node1 node2 one_sided
>     performing 1 non-mpi work for rank 0
>     rank 0 did 79.041 Gb/s in 0.051 seconds
>     rank 1 did 8.667 Gb/s in 0.462 seconds
>     performing 10 non-mpi work for rank 0
>     rank 0 did 2.641 Gb/s in 1.515 seconds
>     rank 1 did 2.077 Gb/s in 1.925 seconds
>     performing 100 non-mpi work for rank 0
>     rank 0 did 1.336 Gb/s in 2.993 seconds
>     rank 1 did 1.178 Gb/s in 3.397 seconds
>     performing 1000 non-mpi work for rank 0
>     rank 0 did 0.867 Gb/s in 4.611 seconds
>     rank 1 did 0.802 Gb/s in 4.985 seconds
>     performing 10000 non-mpi work for rank 0
>     rank 0 did 0.522 Gb/s in 7.657 seconds
>     rank 1 did 0.499 Gb/s in 8.016 seconds
>
>     What I would have expected is that rank 0 gets slower since it
>     becomes compute bound, but rank 1 should stay just as fast. 
>     Instead they both get slower together.
>
>     Here's what I used:
>     $mpiname -a
>     MVAPICH2 1.6rc2 2010-12-23 ch3:mrail
>
>     Compilation
>     CC: gcc  -DNDEBUG -O2
>     CXX: c++  -DNDEBUG -O2
>     F77: gfortran  -DNDEBUG
>     F90:   -DNDEBUG
>
>     Configuration
>     --prefix=/home/sci/thiago/apps/linux64/mvapich2-1.6-rc2-ch3
>     --enable-sharedlibs=gcc --enable-threads --enable-languages=c,c++
>     --disable-f77 --disable-f90 --enable-error-checking=no
>     --enable-fast=all
>
>     Is there anything I could do to get this to scale?  Did I do
>     something wrong?
>
>     Thanks,
>     Thiago
>
>     #include <mpi.h>
>     #include <iostream>
>     #include <stdio.h>
>     #include <cmath>
>     using namespace std;
>
>     int main(int argc, char* argv[]) {
>
>      int provided=-1;
>      int requested = MPI_THREAD_SINGLE;//MPI_THREAD_MULTIPLE;
>      MPI_Init_thread(&argc, &argv, requested, &provided);
>
>      const int rank = MPI::COMM_WORLD.Get_rank();
>
>      const unsigned long NUM_GETS = 10000; // something big
>      const unsigned long BUFFER_SIZE = 2048*32;
>      char* sendBuffer = new char[BUFFER_SIZE];
>
>      MPI::Win win = MPI::Win::Create(sendBuffer, BUFFER_SIZE, 1,
>     MPI_INFO_NULL, MPI_COMM_WORLD);
>
>
>      double startTime = MPI::Wtime();
>
>      char* receiveBuffer = new char[BUFFER_SIZE];
>
>      for (int j=0; j < 5; ++j) {
>        const int repeat = (int)pow(10.0,j);
>        if (rank == 0)
>          printf("performing %d non-mpi work for rank 0\n", repeat);
>        for (unsigned long i = 0; i < NUM_GETS; ++i) {
>
>          if (rank == 0) {
>            for (int i=0; i < repeat; ++i)
>              if (sqrt(float(i))== -3)
>                printf("did some computation that will never be used.\n");
>          }
>
>          int owner = 0;
>          int err = MPI_Win_lock(MPI_LOCK_SHARED, owner, 0, win);
>          if (err != MPI_SUCCESS) cout << "ERROR lock: " <<err<<endl;
>
>          err = MPI_Get(receiveBuffer, BUFFER_SIZE, MPI_CHAR,
>                        owner, 0, BUFFER_SIZE, MPI_CHAR, win);
>          if (err != MPI_SUCCESS) cout << "ERROR get: " <<err<<endl;
>
>          err = MPI_Win_unlock(owner, win);
>          if (err != MPI_SUCCESS) cout << "ERROR unlock: " <<err<<endl;
>        }
>
>        double endTime = MPI::Wtime();
>        double sec = (endTime-startTime);
>        unsigned long bitsGot = 1*8*BUFFER_SIZE*NUM_GETS;
>        float GbitsGot = bitsGot/1073741824;
>        printf("rank %d did %.3lf Gb/s in %.3lf seconds \n",rank,
>     (GbitsGot/sec), sec);
>
>      MPI_Barrier(MPI_COMM_WORLD);
>      sleep(1); // so that the output is in sync and looks pretty
>      }
>      MPI_Win_free(win);
>
>      MPI::Finalize();
>      return 0;
>     }
>
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110728/f9b9c167/attachment.html


More information about the mvapich-discuss mailing list