[mvapich-discuss] collectives fail under mvapich2-1.0 (fwd)

Tue Oct 2 14:20:10 EDT 2007

Hi Kevin, 

Thanks for providing further insights to this problem based on
QLogic's experience.

It will be good if SkaMPI team can take a look at this issue and make
appropriate changes to their benchmarks.

Thanks, 

DK

======

> Hi Wei and Ed,
> 
>   This email triggered a memory in me, and after digging through our bug
> tracking system, found that we ran into this exact same problem with
> PathScale's port of MPICH quite a while ago.  Here is the analysis from
> one of our engineers at that time:
> 
> -----------------------------------------------------------
> he root of the problem is in the function syncol_pattern in
> skosfile.c. It tries to be smart about trying to get a uniform value for all its
> timings (100 max to be precise), and for that, for some unknown reason, it
> thinks that once it gets a flawed measurement, it should wait for an
> inordinately long time (1.7 seconds to be precise) between taking two samples. I
> tried to look around for what the reason for this number could be, but could
> not. And when the next sample is flawed as well, it increments the time it
> should wait. Here is the routine it calls:
> 
> int wait_till(double time_stamp, double *last_time_stamp)
> {
>   if( (*last_time_stamp = MPI_Wtime()) > time_stamp )
>     return 0;
>   else {
>     while( (*last_time_stamp = MPI_Wtime()) < time_stamp ) ;
>     return 1;
>   }
> }
> 
> 
> And the interval to wait for is determined by:
> 
> double should_wait_till(int counter, double interval, double offset)
> {
>   return (counter+1)*interval + first_time + offset;
> }
> 
> 
> As you can see, depending on which sample had a flawed measurement, it increases
> the wait period by that much !!
> 
> And thats why the program crawls and crawls and feels like it is hanging.
> ----------------------------------------------------------------------
> 
> Another engineer added the following:
> 
> -----------------------------------------------------------------------
> Verified, thanks for the analysis and hard work. I made the following edits to
> encourage speedier forward progress but it is still dog slow:
> 
> ===================================================================
> RCS file: RCS/skosfile.c,v
> retrieving revision 1.1
> diff -r1.1 skosfile.c
> 16214c16214,16219
> <           if( max_tbms[i] >= interval ) flawed_measurements++;
> ---
> >           if( max_tbms[i] > interval &&
> >                 (max_tbms[i] - interval) / interval > 0.01 ) {
> >               flawed_measurements++;
> >               /* printf("FLAWED: %f >= %f, flawed_measurements = %d\n",
> >                max_tbms[i], interval, flawed_measurements); */
> >           }
> 16216a16222
> >             /* printf("REALLY FLAWED : interval now %f\n", interval); */
> 
> ----------------------------------------------------------
> 
> On Tue, 2007-10-02 at 08:34, wei huang wrote:
> > Hi Ed,
> > 
> > We look into the problem more wrt one sided issues. However, we don't see
> > the program hang in the MPI library. Actually the program is not hanging.
> > But somehow for MPI_Win_test, we find the following code:
> > 
> >   if (get_measurement_rank() == 0) {
> >     reduced_group = exclude_rank_from_group(0, onesided_group);
> >     mpiassert = extract_onesided_assertions(assertion, "MPI_Win_post");
> >     MPI_Win_post(reduced_group, mpiassert, onesided_win);
> > 
> >     start_time = start_synchronization();
> >     MPI_Win_test(onesided_win, &flag);
> >     end_time = stop_synchronization();
> >     if (flag == 0)
> >       MPI_Win_wait(onesided_win);
> >   }
> >   else {
> >     reduced_group = exclude_all_ranks_except_from_group(0, onesided_group);
> >     mpiassert = extract_onesided_assertions(assertion, "MPI_Win_start");
> >     MPI_Win_start(reduced_group, mpiassert, onesided_win);
> >     if (do_a_put)
> >       MPI_Put(get_send_buffer(), count, datatype, 0, get_measurement_rank(),
> >               count, datatype, onesided_win);
> >     MPI_Win_complete(onesided_win);
> >     start_synchronization();
> >     stop_synchronization();
> >   }
> > 
> > And the test is spending more and more in in start_synchronization(),
> > which seems to calculate a certain timestamp, and busily reads wtime()
> > until we reach that timestamp. We find that start_synchronization() is
> > taking longer and longer time, and finally will spend tens of seconds
> > before it returns. We are not sure how the timestamp is calculated, so we
> > cc this email to SkaMPI team and hope they can give some insights here.
> > 
> > Dear SkaMPI team, we face a problem running SkaMPI using mvapich2-1.0 on
> > 12 processes (3 nodes, 4 processes each, block distribution). We find that
> > start_synchronization() in MPI_Win_test is taking very long time to return
> > as the test goes on. As a result, the test appears to be hang. We are not
> > sure how the timestamp is calculated and how you adjust this value. Could
> > you please help give some insights here?
> > 
> > Thanks.
> > 
> > Regards,
> > Wei Huang
> > 
> > 774 Dreese Lab, 2015 Neil Ave,
> > Dept. of Computer Science and Engineering
> > Ohio State University
> > OH 43210
> > Tel: (614)292-8501
> > 
> > 
> > On Thu, 27 Sep 2007, Edmund Sumbar wrote:
> > 
> > > Edmund Sumbar wrote:
> > > > I'll try running the SKaMPI tests again.  Maybe
> > > > I missed something, as with the mvapich2 tests.
> > >
> > > I recompiled and reran SKaMPI pt2pt, coll,
> > > onesided, and mmisc tests on 3 nodes, 4
> > > processors per node.
> > >
> > > pt2pt and mmisc succeeded, while coll and
> > > onesided failed (stalled).  Any ideas?
> > >
> > > For what it's worth, here are the tails of
> > > the output files...
> > >
> > >
> > > $ tail coll_ib-3x4.sko
> > > # SKaMPI Version 5.0 rev. 191
> > >
> > > begin result "MPI_Bcast-nodes-short"
> > > nodes= 2     1024       3.8       0.2       39       2.9       3.6
> > > nodes= 3     1024       6.6       0.4       38       4.0       6.3       4.9
> > > nodes= 4     1024       9.2       0.2       32       4.6       7.7       7.6       8.6
> > >
> > >
> > > $ tail onesided_ib-3x4.sko
> > > cpus= 8        4   50051.7       1.3        8   50051.7    ---       ---       ---       ---       ---       ---       ---
> > > cpus= 9        4   50051.5       0.7        8   50051.5    ---       ---       ---       ---       ---       ---       ---       ---
> > > cpus= 10        4   50047.7       1.6        8   50047.7    ---       ---       ---       ---       ---       ---       ---
> > > ---       ---
> > > cpus= 11        4   50058.2       2.7        8   50058.2    ---       ---       ---       ---       ---       ---       ---
> > > ---       ---       ---
> > > cpus= 12        4   50074.3       2.8        8   50074.3    ---       ---       ---       ---       ---       ---       ---
> > > ---       ---       ---       ---
> > > end result "MPI_Win_wait delayed,small"
> > > # duration = 9.00 sec
> > >
> > > begin result "MPI_Win_wait delayed without MPI_Put"
> > > cpus= 2  1048576   50025.0       1.4        8   50025.0    ---
> > >
> > >
> > > --
> > > Ed[mund [Sumbar]]
> > > AICT Research Support, Univ of Alberta
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> > 
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>