[mvapich-discuss] truly one-sided communication

sreeram potluri potluri at cse.ohio-state.edu
Thu Jul 28 22:08:01 EDT 2011


Dear Thiago,

Thanks for your email. The truly one-sided passive mode has not been added
in 1.7rc1. It has been deferred to a future release. I will let you know
when this takes shape.

Sreeram Potluri

On Thu, Jul 28, 2011 at 7:48 PM, Thiago Ize <thiago at sci.utah.edu> wrote:

> **
> Hi,
>
> I just tried 1.7 rc1 and my application still seems to have the same
> issue.  Was truly one-sided communication for passive mode actually
> introduced to 1.7 and if so, do I need to do anything special to make use of
> it?
>
> Thanks,
> Thiago
>
> sreeram potluri wrote:
>
> Dear Thiago,
>
>  Thanks for your email. Currently, MVAPICH2 supports truly one-sided
> communication for the active modes of synchronization (Fence and
> Post-Wait/Start-Complete) available in MPI-2 RMA. The truly one-sided
> designs for passive mode are in the pipeline for the next (1.7) release of
> MVAPICH2.
>
>  Thank you
> Sreeram Potluri
>
> On Wed, Jan 19, 2011 at 2:41 AM, Thiago Ize <thiago at sci.utah.edu> wrote:
>
>>  Hi,
>> I was under the impression that mvapich2 used truly one-sided
>> communication when doing a passive MPI_Get.  At least this is my
>> understanding from reading "Efficient Implementation of MPI-2 Passive
>> One-Sided Communication on InfiniBand Clusters."  Does mvapich2 still use a
>> thread-based design or is it using the atomics? Is there some way to get
>> mvapich2 to not depend on the target process being able to respond to a
>> passive MPI_Get?
>>
>> Here's my problem.  I need to do very low latency mpi_gets very frequently
>> from lots of nodes to lots of other nodes over Infiniband, but occasionally
>> some of the target processes from which I need to remotely read memory from
>> will have extensive computation to do or will be sleeping in a
>> mutex/barrier, so they will not do any mpi calls and will not be able to
>> progress the mpi engine.  When this occurs performance plummets across all
>> the nodes (a 10us MPI_Get becomes 10s).
>> My program requires multi-threading support, but I can reproduce this
>> issue with serial code as well and have attached an example program that
>> shows this.  The program has each process perform an MPI_Get on rank 0's
>> data.  I added some superfluous computation that is performed only by rank
>> 0. Here's the output I get:
>> $ mpirun_rsh -np 2 node1 node2 one_sided
>> performing 1 non-mpi work for rank 0
>> rank 0 did 79.041 Gb/s in 0.051 seconds
>> rank 1 did 8.667 Gb/s in 0.462 seconds
>> performing 10 non-mpi work for rank 0
>> rank 0 did 2.641 Gb/s in 1.515 seconds
>> rank 1 did 2.077 Gb/s in 1.925 seconds
>> performing 100 non-mpi work for rank 0
>> rank 0 did 1.336 Gb/s in 2.993 seconds
>> rank 1 did 1.178 Gb/s in 3.397 seconds
>> performing 1000 non-mpi work for rank 0
>> rank 0 did 0.867 Gb/s in 4.611 seconds
>> rank 1 did 0.802 Gb/s in 4.985 seconds
>> performing 10000 non-mpi work for rank 0
>> rank 0 did 0.522 Gb/s in 7.657 seconds
>> rank 1 did 0.499 Gb/s in 8.016 seconds
>>
>> What I would have expected is that rank 0 gets slower since it becomes
>> compute bound, but rank 1 should stay just as fast.  Instead they both get
>> slower together.
>>
>> Here's what I used:
>> $mpiname -a
>> MVAPICH2 1.6rc2 2010-12-23 ch3:mrail
>>
>> Compilation
>> CC: gcc  -DNDEBUG -O2
>> CXX: c++  -DNDEBUG -O2
>> F77: gfortran  -DNDEBUG
>> F90:   -DNDEBUG
>>
>> Configuration
>> --prefix=/home/sci/thiago/apps/linux64/mvapich2-1.6-rc2-ch3
>> --enable-sharedlibs=gcc --enable-threads --enable-languages=c,c++
>> --disable-f77 --disable-f90 --enable-error-checking=no --enable-fast=all
>>
>> Is there anything I could do to get this to scale?  Did I do something
>> wrong?
>>
>> Thanks,
>> Thiago
>>
>> #include <mpi.h>
>> #include <iostream>
>> #include <stdio.h>
>> #include <cmath>
>> using namespace std;
>>
>> int main(int argc, char* argv[]) {
>>
>>  int provided=-1;
>>  int requested = MPI_THREAD_SINGLE;//MPI_THREAD_MULTIPLE;
>>  MPI_Init_thread(&argc, &argv, requested, &provided);
>>
>>  const int rank = MPI::COMM_WORLD.Get_rank();
>>
>>  const unsigned long NUM_GETS = 10000; // something big
>>  const unsigned long BUFFER_SIZE = 2048*32;
>>  char* sendBuffer = new char[BUFFER_SIZE];
>>
>>  MPI::Win win = MPI::Win::Create(sendBuffer, BUFFER_SIZE, 1,
>> MPI_INFO_NULL, MPI_COMM_WORLD);
>>
>>
>>  double startTime = MPI::Wtime();
>>
>>  char* receiveBuffer = new char[BUFFER_SIZE];
>>
>>  for (int j=0; j < 5; ++j) {
>>    const int repeat = (int)pow(10.0,j);
>>    if (rank == 0)
>>      printf("performing %d non-mpi work for rank 0\n", repeat);
>>    for (unsigned long i = 0; i < NUM_GETS; ++i) {
>>
>>      if (rank == 0) {
>>        for (int i=0; i < repeat; ++i)
>>          if (sqrt(float(i))== -3)
>>            printf("did some computation that will never be used.\n");
>>      }
>>
>>      int owner = 0;
>>      int err = MPI_Win_lock(MPI_LOCK_SHARED, owner, 0, win);
>>      if (err != MPI_SUCCESS) cout << "ERROR lock: " <<err<<endl;
>>
>>      err = MPI_Get(receiveBuffer, BUFFER_SIZE, MPI_CHAR,
>>                    owner, 0, BUFFER_SIZE, MPI_CHAR, win);
>>      if (err != MPI_SUCCESS) cout << "ERROR get: " <<err<<endl;
>>
>>      err = MPI_Win_unlock(owner, win);
>>      if (err != MPI_SUCCESS) cout << "ERROR unlock: " <<err<<endl;
>>    }
>>
>>    double endTime = MPI::Wtime();
>>    double sec = (endTime-startTime);
>>    unsigned long bitsGot = 1*8*BUFFER_SIZE*NUM_GETS;
>>    float GbitsGot = bitsGot/1073741824;
>>    printf("rank %d did %.3lf Gb/s in %.3lf seconds \n",rank,
>> (GbitsGot/sec), sec);
>>
>>  MPI_Barrier(MPI_COMM_WORLD);
>>  sleep(1); // so that the output is in sync and looks pretty
>>  }
>>  MPI_Win_free(win);
>>
>>  MPI::Finalize();
>>  return 0;
>> }
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110728/7e3e6b28/attachment-0001.html


More information about the mvapich-discuss mailing list