[mvapich-discuss] collectives fail under mvapich2-1.0 (fwd)

Ralf Reussner reussner at ipd.uka.de
Thu Oct 4 05:20:30 EDT 2007


Dear Wei Huang,

thanks for your report. We will have a look into this. Please include 
instead of my email adress the email of our developer & maintainer team 
skampi at ira.uka.de in the discussions, so that all of us are informend on 
the discussion.

Best regards
Ralf
> Hi Ed,
>
> We look into the problem more wrt one sided issues. However, we don't see
> the program hang in the MPI library. Actually the program is not hanging.
> But somehow for MPI_Win_test, we find the following code:
>
>   if (get_measurement_rank() == 0) {
>     reduced_group = exclude_rank_from_group(0, onesided_group);
>     mpiassert = extract_onesided_assertions(assertion, "MPI_Win_post");
>     MPI_Win_post(reduced_group, mpiassert, onesided_win);
>
>     start_time = start_synchronization();
>     MPI_Win_test(onesided_win, &flag);
>     end_time = stop_synchronization();
>     if (flag == 0)
>       MPI_Win_wait(onesided_win);
>   }
>   else {
>     reduced_group = exclude_all_ranks_except_from_group(0, onesided_group);
>     mpiassert = extract_onesided_assertions(assertion, "MPI_Win_start");
>     MPI_Win_start(reduced_group, mpiassert, onesided_win);
>     if (do_a_put)
>       MPI_Put(get_send_buffer(), count, datatype, 0, get_measurement_rank(),
>               count, datatype, onesided_win);
>     MPI_Win_complete(onesided_win);
>     start_synchronization();
>     stop_synchronization();
>   }
>
> And the test is spending more and more in in start_synchronization(),
> which seems to calculate a certain timestamp, and busily reads wtime()
> until we reach that timestamp. We find that start_synchronization() is
> taking longer and longer time, and finally will spend tens of seconds
> before it returns. We are not sure how the timestamp is calculated, so we
> cc this email to SkaMPI team and hope they can give some insights here.
>
> Dear SkaMPI team, we face a problem running SkaMPI using mvapich2-1.0 on
> 12 processes (3 nodes, 4 processes each, block distribution). We find that
> start_synchronization() in MPI_Win_test is taking very long time to return
> as the test goes on. As a result, the test appears to be hang. We are not
> sure how the timestamp is calculated and how you adjust this value. Could
> you please help give some insights here?
>
> Thanks.
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>
> On Thu, 27 Sep 2007, Edmund Sumbar wrote:
>
>   
>> Edmund Sumbar wrote:
>>     
>>> I'll try running the SKaMPI tests again.  Maybe
>>> I missed something, as with the mvapich2 tests.
>>>       
>> I recompiled and reran SKaMPI pt2pt, coll,
>> onesided, and mmisc tests on 3 nodes, 4
>> processors per node.
>>
>> pt2pt and mmisc succeeded, while coll and
>> onesided failed (stalled).  Any ideas?
>>
>> For what it's worth, here are the tails of
>> the output files...
>>
>>
>> $ tail coll_ib-3x4.sko
>> # SKaMPI Version 5.0 rev. 191
>>
>> begin result "MPI_Bcast-nodes-short"
>> nodes= 2     1024       3.8       0.2       39       2.9       3.6
>> nodes= 3     1024       6.6       0.4       38       4.0       6.3       4.9
>> nodes= 4     1024       9.2       0.2       32       4.6       7.7       7.6       8.6
>>
>>
>> $ tail onesided_ib-3x4.sko
>> cpus= 8        4   50051.7       1.3        8   50051.7    ---       ---       ---       ---       ---       ---       ---
>> cpus= 9        4   50051.5       0.7        8   50051.5    ---       ---       ---       ---       ---       ---       ---       ---
>> cpus= 10        4   50047.7       1.6        8   50047.7    ---       ---       ---       ---       ---       ---       ---
>> ---       ---
>> cpus= 11        4   50058.2       2.7        8   50058.2    ---       ---       ---       ---       ---       ---       ---
>> ---       ---       ---
>> cpus= 12        4   50074.3       2.8        8   50074.3    ---       ---       ---       ---       ---       ---       ---
>> ---       ---       ---       ---
>> end result "MPI_Win_wait delayed,small"
>> # duration = 9.00 sec
>>
>> begin result "MPI_Win_wait delayed without MPI_Put"
>> cpus= 2  1048576   50025.0       1.4        8   50025.0    ---
>>
>>
>> --
>> Ed[mund [Sumbar]]
>> AICT Research Support, Univ of Alberta
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>     
>
>   


-- 
--------------------------------------------------------------
Prof. Dr. Ralf Reussner  -  Chair Software-Design and -Quality
Institute for Program Structures and Data Organization
Faculty of Informatics, Universitaet Karlsruhe (TH)
Am Fasanengarten 5, D-76131 Karlsruhe, Germany
Office 327, Main Computer Science Building (50.34)
Tel. +49 721 608 5993, Fax. +49 721 608 5990
http://sdq.ipd.uka.de
--------------------------------------------------------------




More information about the mvapich-discuss mailing list