[Skampi] Re: [mvapich-discuss] collectives fail under mvapich2-1.0 (fwd)

Wed Oct 31 14:11:34 EDT 2007

On Thu, 04 Oct 2007 11:20:30 +0200
Ralf Reussner <reussner at ipd.uka.de> wrote:

> Dear Wei Huang,
> 
> thanks for your report. We will have a look into this. Please include 
> instead of my email adress the email of our developer & maintainer
> team skampi at ira.uka.de in the discussions, so that all of us are
> informend on the discussion.

Hi,

sorry that I kept you waiting for so long, but I needed a quiet hour to
have a look at the algorithm myself. Meanwhile some guys from OpenMPI
had some similar problems so I've put something together:

The basic idea behind the algorithm is that the measurement
environment starts the communication operation synchronous on all
processors (taking into consideration the time differences between the
local clocks) and provides a time slot where there is no other
interfering communication.

so we have something like

   while number measurements resp. precision not reached

      repeat a couple of times

          start_synchronization()
             wait for start of time slot in the future (if time slot
             doesn't start in the future continue but remember the
             measurement is INVALID_STARTED_LATE)

          MPI_Whatever(....) 

          stop_synchronization()
             check if time is still in time slot otherwise mark
             measurements as INVALID_TOOK_TOO_LONG

    gather results and valid flags

    pick out the valid measurements ignore the other ones

    decide if the time slot is large enough, increase time slot if
    necessary 

You can find a more detailed explanation in our EuroPVM/MPI
contribution: http://liinwww.ira.uka.de/~augustin/europvm-mpi02.ps

The tricky part is obviously the decision how to choose the time slot
for future measurements so that we don't waste too much time waiting.
If we have too many communications which don't fit into the time slot
it is obviously too small. But sometimes it happens that we have some
longer delay (interrupt on one processor or some other operating
system job running) which happen only occasionally but interfere with
several successive time slots. These shouldn't increase the time slots
but perhaps decrease the number of repetitions till we collect the
results and re-synchronise. 

The second problem is precise waiting for such short time spans. I've
found a lot of not really portable routines with impressive names like
nano-sleep but which were horrible inaccurate or insisted to sleep at
least for one millisecond. Therefore I had to rely on busy waiting with
MPI_Wtime. So it might be, that the MPI implementation doesn't do any
progress while spinning in MPI_Wtime calls?! 

I've had the symptoms you described once when I didn't realise that
due to hardware failures my LAM-MPI was running only on one host. So I
had more processes than processors and therefore because of the busy
waiting there were always some processes which started their time slot
late and therefore the time slot exploded till it reached the time
slice of the operating system scheduler.

The whole algorithm is already a couple of years old, so if you have
some suggestions of improving it, your welcome. Especially when I was
thinking about this busy waiting I needed some special psychological
tricks to get sleep at night ;-)

For further debugging you could try to run a very short input file
like for example:

set_min_repetitions(8)
set_max_repetitions(32)
set_max_relative_standard_error(0.03)

set_skampi_buffer(4MB)

begin measurement "MPI_Bcast-procs-short"
  measure MPI_COMM_WORLD: Bcast(256, MPI_INT, 0)
end measurement

in debug synchronisation mode:

mpirun -np 4 ./skampi -d 128 -i short.ski

time critical debug information is not printed on stdout but written
separately for every process in a file proc00??.log (all times are in
microseconds). 

If you send me your input file, measurement implementation if it's not a
standard one and the produced output and log files I'll have a look at
them.

Another idea to discover what's going on is to replace the busy waiting:
in measurement.c__wait_till(...) 

exchange 

    while( (*last_time_stamp = wtime()) < time_stamp ) ;

with

    best_sleep_on_your_machine(time_stamp - *last_time_stamp); 

(time_stamp and last_time_stamp are seconds, adjust appropriately)

bye,
Werner Augustin

P.S. please keep skampi at ira.uka.de on the CC because I'm not on the
mvapich-discuss list