[mvapich-discuss] collectives fail under mvapich2-1.0 (fwd)
Dhabaleswar Panda
panda at cse.ohio-state.edu
Tue Oct 2 14:20:10 EDT 2007
Hi Kevin,
Thanks for providing further insights to this problem based on
QLogic's experience.
It will be good if SkaMPI team can take a look at this issue and make
appropriate changes to their benchmarks.
Thanks,
DK
======
> Hi Wei and Ed,
>
> This email triggered a memory in me, and after digging through our bug
> tracking system, found that we ran into this exact same problem with
> PathScale's port of MPICH quite a while ago. Here is the analysis from
> one of our engineers at that time:
>
> -----------------------------------------------------------
> he root of the problem is in the function syncol_pattern in
> skosfile.c. It tries to be smart about trying to get a uniform value for all its
> timings (100 max to be precise), and for that, for some unknown reason, it
> thinks that once it gets a flawed measurement, it should wait for an
> inordinately long time (1.7 seconds to be precise) between taking two samples. I
> tried to look around for what the reason for this number could be, but could
> not. And when the next sample is flawed as well, it increments the time it
> should wait. Here is the routine it calls:
>
> int wait_till(double time_stamp, double *last_time_stamp)
> {
> if( (*last_time_stamp = MPI_Wtime()) > time_stamp )
> return 0;
> else {
> while( (*last_time_stamp = MPI_Wtime()) < time_stamp ) ;
> return 1;
> }
> }
>
>
> And the interval to wait for is determined by:
>
> double should_wait_till(int counter, double interval, double offset)
> {
> return (counter+1)*interval + first_time + offset;
> }
>
>
> As you can see, depending on which sample had a flawed measurement, it increases
> the wait period by that much !!
>
> And thats why the program crawls and crawls and feels like it is hanging.
> ----------------------------------------------------------------------
>
> Another engineer added the following:
>
> -----------------------------------------------------------------------
> Verified, thanks for the analysis and hard work. I made the following edits to
> encourage speedier forward progress but it is still dog slow:
>
> ===================================================================
> RCS file: RCS/skosfile.c,v
> retrieving revision 1.1
> diff -r1.1 skosfile.c
> 16214c16214,16219
> < if( max_tbms[i] >= interval ) flawed_measurements++;
> ---
> > if( max_tbms[i] > interval &&
> > (max_tbms[i] - interval) / interval > 0.01 ) {
> > flawed_measurements++;
> > /* printf("FLAWED: %f >= %f, flawed_measurements = %d\n",
> > max_tbms[i], interval, flawed_measurements); */
> > }
> 16216a16222
> > /* printf("REALLY FLAWED : interval now %f\n", interval); */
>
> ----------------------------------------------------------
>
> On Tue, 2007-10-02 at 08:34, wei huang wrote:
> > Hi Ed,
> >
> > We look into the problem more wrt one sided issues. However, we don't see
> > the program hang in the MPI library. Actually the program is not hanging.
> > But somehow for MPI_Win_test, we find the following code:
> >
> > if (get_measurement_rank() == 0) {
> > reduced_group = exclude_rank_from_group(0, onesided_group);
> > mpiassert = extract_onesided_assertions(assertion, "MPI_Win_post");
> > MPI_Win_post(reduced_group, mpiassert, onesided_win);
> >
> > start_time = start_synchronization();
> > MPI_Win_test(onesided_win, &flag);
> > end_time = stop_synchronization();
> > if (flag == 0)
> > MPI_Win_wait(onesided_win);
> > }
> > else {
> > reduced_group = exclude_all_ranks_except_from_group(0, onesided_group);
> > mpiassert = extract_onesided_assertions(assertion, "MPI_Win_start");
> > MPI_Win_start(reduced_group, mpiassert, onesided_win);
> > if (do_a_put)
> > MPI_Put(get_send_buffer(), count, datatype, 0, get_measurement_rank(),
> > count, datatype, onesided_win);
> > MPI_Win_complete(onesided_win);
> > start_synchronization();
> > stop_synchronization();
> > }
> >
> > And the test is spending more and more in in start_synchronization(),
> > which seems to calculate a certain timestamp, and busily reads wtime()
> > until we reach that timestamp. We find that start_synchronization() is
> > taking longer and longer time, and finally will spend tens of seconds
> > before it returns. We are not sure how the timestamp is calculated, so we
> > cc this email to SkaMPI team and hope they can give some insights here.
> >
> > Dear SkaMPI team, we face a problem running SkaMPI using mvapich2-1.0 on
> > 12 processes (3 nodes, 4 processes each, block distribution). We find that
> > start_synchronization() in MPI_Win_test is taking very long time to return
> > as the test goes on. As a result, the test appears to be hang. We are not
> > sure how the timestamp is calculated and how you adjust this value. Could
> > you please help give some insights here?
> >
> > Thanks.
> >
> > Regards,
> > Wei Huang
> >
> > 774 Dreese Lab, 2015 Neil Ave,
> > Dept. of Computer Science and Engineering
> > Ohio State University
> > OH 43210
> > Tel: (614)292-8501
> >
> >
> > On Thu, 27 Sep 2007, Edmund Sumbar wrote:
> >
> > > Edmund Sumbar wrote:
> > > > I'll try running the SKaMPI tests again. Maybe
> > > > I missed something, as with the mvapich2 tests.
> > >
> > > I recompiled and reran SKaMPI pt2pt, coll,
> > > onesided, and mmisc tests on 3 nodes, 4
> > > processors per node.
> > >
> > > pt2pt and mmisc succeeded, while coll and
> > > onesided failed (stalled). Any ideas?
> > >
> > > For what it's worth, here are the tails of
> > > the output files...
> > >
> > >
> > > $ tail coll_ib-3x4.sko
> > > # SKaMPI Version 5.0 rev. 191
> > >
> > > begin result "MPI_Bcast-nodes-short"
> > > nodes= 2 1024 3.8 0.2 39 2.9 3.6
> > > nodes= 3 1024 6.6 0.4 38 4.0 6.3 4.9
> > > nodes= 4 1024 9.2 0.2 32 4.6 7.7 7.6 8.6
> > >
> > >
> > > $ tail onesided_ib-3x4.sko
> > > cpus= 8 4 50051.7 1.3 8 50051.7 --- --- --- --- --- --- ---
> > > cpus= 9 4 50051.5 0.7 8 50051.5 --- --- --- --- --- --- --- ---
> > > cpus= 10 4 50047.7 1.6 8 50047.7 --- --- --- --- --- --- ---
> > > --- ---
> > > cpus= 11 4 50058.2 2.7 8 50058.2 --- --- --- --- --- --- ---
> > > --- --- ---
> > > cpus= 12 4 50074.3 2.8 8 50074.3 --- --- --- --- --- --- ---
> > > --- --- --- ---
> > > end result "MPI_Win_wait delayed,small"
> > > # duration = 9.00 sec
> > >
> > > begin result "MPI_Win_wait delayed without MPI_Put"
> > > cpus= 2 1048576 50025.0 1.4 8 50025.0 ---
> > >
> > >
> > > --
> > > Ed[mund [Sumbar]]
> > > AICT Research Support, Univ of Alberta
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list