[mvapich-discuss] collectives fail under mvapich2-1.0 (fwd)

Kevin Ball kevin.ball at qlogic.com
Tue Oct 2 13:01:20 EDT 2007


Hi Wei and Ed,

  This email triggered a memory in me, and after digging through our bug
tracking system, found that we ran into this exact same problem with
PathScale's port of MPICH quite a while ago.  Here is the analysis from
one of our engineers at that time:

-----------------------------------------------------------
he root of the problem is in the function syncol_pattern in
skosfile.c. It tries to be smart about trying to get a uniform value for all its
timings (100 max to be precise), and for that, for some unknown reason, it
thinks that once it gets a flawed measurement, it should wait for an
inordinately long time (1.7 seconds to be precise) between taking two samples. I
tried to look around for what the reason for this number could be, but could
not. And when the next sample is flawed as well, it increments the time it
should wait. Here is the routine it calls:

int wait_till(double time_stamp, double *last_time_stamp)
{
  if( (*last_time_stamp = MPI_Wtime()) > time_stamp )
    return 0;
  else {
    while( (*last_time_stamp = MPI_Wtime()) < time_stamp ) ;
    return 1;
  }
}


And the interval to wait for is determined by:

double should_wait_till(int counter, double interval, double offset)
{
  return (counter+1)*interval + first_time + offset;
}


As you can see, depending on which sample had a flawed measurement, it increases
the wait period by that much !!

And thats why the program crawls and crawls and feels like it is hanging.
----------------------------------------------------------------------

Another engineer added the following:

-----------------------------------------------------------------------
Verified, thanks for the analysis and hard work. I made the following edits to
encourage speedier forward progress but it is still dog slow:

===================================================================
RCS file: RCS/skosfile.c,v
retrieving revision 1.1
diff -r1.1 skosfile.c
16214c16214,16219
<           if( max_tbms[i] >= interval ) flawed_measurements++;
---
>           if( max_tbms[i] > interval &&
>                 (max_tbms[i] - interval) / interval > 0.01 ) {
>               flawed_measurements++;
>               /* printf("FLAWED: %f >= %f, flawed_measurements = %d\n",
>                max_tbms[i], interval, flawed_measurements); */
>           }
16216a16222
>             /* printf("REALLY FLAWED : interval now %f\n", interval); */

----------------------------------------------------------

On Tue, 2007-10-02 at 08:34, wei huang wrote:
> Hi Ed,
> 
> We look into the problem more wrt one sided issues. However, we don't see
> the program hang in the MPI library. Actually the program is not hanging.
> But somehow for MPI_Win_test, we find the following code:
> 
>   if (get_measurement_rank() == 0) {
>     reduced_group = exclude_rank_from_group(0, onesided_group);
>     mpiassert = extract_onesided_assertions(assertion, "MPI_Win_post");
>     MPI_Win_post(reduced_group, mpiassert, onesided_win);
> 
>     start_time = start_synchronization();
>     MPI_Win_test(onesided_win, &flag);
>     end_time = stop_synchronization();
>     if (flag == 0)
>       MPI_Win_wait(onesided_win);
>   }
>   else {
>     reduced_group = exclude_all_ranks_except_from_group(0, onesided_group);
>     mpiassert = extract_onesided_assertions(assertion, "MPI_Win_start");
>     MPI_Win_start(reduced_group, mpiassert, onesided_win);
>     if (do_a_put)
>       MPI_Put(get_send_buffer(), count, datatype, 0, get_measurement_rank(),
>               count, datatype, onesided_win);
>     MPI_Win_complete(onesided_win);
>     start_synchronization();
>     stop_synchronization();
>   }
> 
> And the test is spending more and more in in start_synchronization(),
> which seems to calculate a certain timestamp, and busily reads wtime()
> until we reach that timestamp. We find that start_synchronization() is
> taking longer and longer time, and finally will spend tens of seconds
> before it returns. We are not sure how the timestamp is calculated, so we
> cc this email to SkaMPI team and hope they can give some insights here.
> 
> Dear SkaMPI team, we face a problem running SkaMPI using mvapich2-1.0 on
> 12 processes (3 nodes, 4 processes each, block distribution). We find that
> start_synchronization() in MPI_Win_test is taking very long time to return
> as the test goes on. As a result, the test appears to be hang. We are not
> sure how the timestamp is calculated and how you adjust this value. Could
> you please help give some insights here?
> 
> Thanks.
> 
> Regards,
> Wei Huang
> 
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
> 
> 
> On Thu, 27 Sep 2007, Edmund Sumbar wrote:
> 
> > Edmund Sumbar wrote:
> > > I'll try running the SKaMPI tests again.  Maybe
> > > I missed something, as with the mvapich2 tests.
> >
> > I recompiled and reran SKaMPI pt2pt, coll,
> > onesided, and mmisc tests on 3 nodes, 4
> > processors per node.
> >
> > pt2pt and mmisc succeeded, while coll and
> > onesided failed (stalled).  Any ideas?
> >
> > For what it's worth, here are the tails of
> > the output files...
> >
> >
> > $ tail coll_ib-3x4.sko
> > # SKaMPI Version 5.0 rev. 191
> >
> > begin result "MPI_Bcast-nodes-short"
> > nodes= 2     1024       3.8       0.2       39       2.9       3.6
> > nodes= 3     1024       6.6       0.4       38       4.0       6.3       4.9
> > nodes= 4     1024       9.2       0.2       32       4.6       7.7       7.6       8.6
> >
> >
> > $ tail onesided_ib-3x4.sko
> > cpus= 8        4   50051.7       1.3        8   50051.7    ---       ---       ---       ---       ---       ---       ---
> > cpus= 9        4   50051.5       0.7        8   50051.5    ---       ---       ---       ---       ---       ---       ---       ---
> > cpus= 10        4   50047.7       1.6        8   50047.7    ---       ---       ---       ---       ---       ---       ---
> > ---       ---
> > cpus= 11        4   50058.2       2.7        8   50058.2    ---       ---       ---       ---       ---       ---       ---
> > ---       ---       ---
> > cpus= 12        4   50074.3       2.8        8   50074.3    ---       ---       ---       ---       ---       ---       ---
> > ---       ---       ---       ---
> > end result "MPI_Win_wait delayed,small"
> > # duration = 9.00 sec
> >
> > begin result "MPI_Win_wait delayed without MPI_Put"
> > cpus= 2  1048576   50025.0       1.4        8   50025.0    ---
> >
> >
> > --
> > Ed[mund [Sumbar]]
> > AICT Research Support, Univ of Alberta
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list