[mvapich-discuss] Hang in MPI_Isend/MPI_Recv combination

Fri Aug 14 18:44:49 EDT 2009

Hi Krishna,

> Dorian,
>             Good to know that the temporary work around has worked for 
> you too. But, this indicates that there is still something wrong with 
> our library. We will try to figure out a more concrete fix in the coming 
> few days.

Thanks for your help with this!

>             Thanks for sending the profiling information. I will take a 
> look at it. Also, I was also wondering along these lines :  With my 
> understanding of the application so far,  the code snippet that you had 
> sent us separates two communication phases of the application. 

You are referring to the first code snippet I send? This is the second communication step. You're right: There are two steps.

1. Exchange Graph edges
-> Here implemented by a MPI_Alltoall call on an intercommunicator with Group A sending and Group B only receiving (i.e. sendcounts all zero)
2. Exchange Data along the edges
-> Done by the MPI_Isend/MPI_Probe in this case (so I actually don't use all the information available but the receiver only needs to know the number of senders to know for how many messages he has to probe).

 During 
> the execution of this code, most of the processes are either waiting 
> inside a barrier or the waitall calls.

I hope not. Group A and Group B are disjoint and their union is the group of all processors. Therefore all processes are either sending, or receiving. The barrier is only used for my timing ...

 Since there are so many processes 
> involved,  is it possible that we missed atleast one process that was 
> still in the previous phase of the application? I was wondering if we 
> could have each process make a call to barrier at the beginning of this 
> code so that we can know for sure that all the processes have completed 
> executing upto this phase. Please let me know if this is feasible and if 
> you make such a change in the code and re-send it.

In principle it shouldn't be necessary as there is an allreduce in the methods which I inserted to check that the number of send messages and the number of (expected) messages to receive matches.
However - to be sure - I added the barrier. Please find attached the patch for this (let me know if it works, I'm not so used to create patches ...). In my test, the outcome is the same.


>             And we are also interested in looking at the performance 
> comparisons that you were speaking about.

Perfect. I will take some new measurements with the parameters you send me and will prepare some graphs ...

Thanks,
Dorian

> 
> Thanks,
> Krishna
> 
> Dorian Krause wrote:
> > Hi Krishna,
> >
> > sorry, I always forget to send to the list in cc ...
> >
> > I have tested the code with open-mpi and varying the eager size limit 
> > (which is 12kb for the openib btl by default) down to 1kb. It still 
> > works,
> >
> > Thanks,
> > Dorian
> >
> > Krishna Chaitanya Kandalla wrote:
> >> Dorian,
> >>    On our systems, by tweaking a few parameters, I was able to get 
> >> the application to complete upto 128 processes. You can probably run 
> >> your application in the following manner and let us know if it works 
> >> for you too.
> >>
> >> mpirun_rsh -np 128 -hostfile ./hosts MV2_IBA_EAGER_THRESHOLD=16384 
> >> MV2_VBUF_TOTAL_SIZE=16384  scale_Trans_AlltoalPt2Pt abcdefg
> >>
> >>   >  How is the nonblocking communication implemented?
> >>
> >>     Non-blocking calls are designed to provide overlap between 
> >> communication and computation. Calls to MPI_Isend and MPI_Irecv 
> >> return without waiting for a confirmation from the library if the 
> >> message has actually been sent/received. The applications are 
> >> supposed to do an MPI_Wait later to make sure that the exchange has 
> >> been completed. So, as long as the user does not touch the buffers 
> >> that were used for the Isend and Irecv calls, things should be ok.  
> >> In MVAPICH2, the pt2pt calls use the "eager" protocol for messages of 
> >> size less than about 8K and the rendezvous protocol for larger 
> >> messages. By using the above run-time flags, we can alter the 
> >> threshold between eager and rndv messages. Its not clear as to how 
> >> the application passes when this threshold is set to 16K.  Do you 
> >> have any profiling information regarding the size of message 
> >> exchanged? Also, I noticed a lot of calls to Alltoall being made. It 
> >> will help if you can provide us some information about the size of 
> >> the buffers for the alltoall operations too.
> >>
> >>
> >>
> >> Thanks,
> >> Krishna
> >>
> >>  
> >>           Dorian Krause wrote:
> >>> Hi Krishna,
> >>>
> >>> Krishna Chaitanya Kandalla wrote:
> >>>> Dorian,
> >>>>           Were you able to run your application with open-mpi as well?
> >>>
> >>> Yes, I have no problem to run it with open-mpi (version 1.3.2).
> >>>
> >>>>   If it is passing with both mpich2 and open-mpi, it indicates that 
> >>>> the mvapich2 library is doing something wrong.
> >>>
> >>> I don't know how I should interpret the program behavior. As you 
> >>> have pointed out, the crucial question is how the set of neighbors 
> >>> is constructed. You might have seen that I have inserted a small 
> >>> check in the code to check if the number of sends and the number of 
> >>> (expected) sends matches on the other side. This is the case.
> >>> Since the hang occurs with all three methods to construct the 
> >>> neighbor set, either all of them are wrong, or the hang is not 
> >>> directly related to this.
> >>>
> >>> For me it looks like the following:
> >>> Processor 40 sends the envelope to PE 12. PE 12 probes the message 
> >>> and issues a recv. In the meantime however, PE 40 somehow slipped 
> >>> through the MPI_Waitall function and so there is no matching send 
> >>> operation.
> >>>
> >>> Could it be the case (I'm just speculating). How is the nonblocking 
> >>> communication implemented?
> >>>
> >>>
> >>> Thanks,
> >>> Dorian
> >>>
> >>>> I tried toggling some of the mvapich2 related parameters, but the 
> >>>> hang doesnt seem to go away.
> >>>>
> >>>> Thanks,
> >>>> Krishna
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Dorian Krause wrote:
> >>>>> Hi Krishna,
> >>>>>
> >>>>> thanks for your tests. If I can be of any help in finding the bug, 
> >>>>> please let me know ...
> >>>>>
> >>>>> Thanks,
> >>>>> Dorian
> >>>>>
> >>>>> Krishna Chaitanya Kandalla wrote:
> >>>>>> Dorian,
> >>>>>>           I am able to reproduce the hang with 96 processes on 
> >>>>>> our systems. I also checked that it runs correctly with 
> >>>>>> MPICH2-1.0.8. We will try to find a fix soon.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Krishna
> >>>>>>
> >>>>>>
> >>>>>> Krishna Chaitanya Kandalla wrote:
> >>>>>>> Dorian,
> >>>>>>>           I have taken a quick look at the  set of back-traces. 
> >>>>>>> Is it possible to give us a copy of the application that you are 
> >>>>>>> running?
> >>>>>>>           I noticed that the application is possibly changing 
> >>>>>>> the topology before it gets inside the MPI layer and hangs. I am 
> >>>>>>> also guessing that the code snippet that you provided is related 
> >>>>>>> to what is going on inside  hgc::comm::Topology::barrier.  But, 
> >>>>>>> we dont quite know how the set "all neighbors" has been setup. 
> >>>>>>> If we can run the application on our systems here, it would be 
> >>>>>>> easier to figure out what is going on.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Krishna
> >>>>>>>
> >>>>>>> Dorian Krause wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> again these 96 processors ...
> >>>>>>>>
> >>>>>>>> My application hangs in a communication step which looks like 
> >>>>>>>> this:
> >>>>>>>>
> >>>>>>>> ---------
> >>>>>>>> Group A:
> >>>>>>>>
> >>>>>>>>    for all neighbors {
> >>>>>>>>       MPI_Isend(...);
> >>>>>>>>    }
> >>>>>>>>   MPI_Waitall(...);
> >>>>>>>>
> >>>>>>>>    MPI_Barrier();
> >>>>>>>> ----
> >>>>>>>> Group B:
> >>>>>>>>      while(#messages to receive > 0) {
> >>>>>>>>       MPI_Probe(MPI_ANY_SOURCE, &stat);
> >>>>>>>>       q = stat.MPI_SOURCE
> >>>>>>>>       /* in subfunction: */
> >>>>>>>>       MPI_Probe(q, &stat)
> >>>>>>>>       q = stat.MPI_COUNT;
> >>>>>>>>       MPI_Recv(q, ...);
> >>>>>>>>    }
> >>>>>>>>    MPI_Barrier();
> >>>>>>>> ----
> >>>>>>>>
> >>>>>>>> for more 96 processes this application hangs. Since I can't 
> >>>>>>>> debug on this scale, I used gdb to get backtraces. It tourned 
> >>>>>>>> out that 94 processes are waiting in the barrier, One processor 
> >>>>>>>> is trying to receive a message (stuck in MPI_Recv) and one 
> >>>>>>>> other is waiting in MPI_Waitall(...). This looks fine, however 
> >>>>>>>> the ranks do not match:
> >>>>>>>>
> >>>>>>>> On the PE with rank 83, I have
> >>>>>>>>
> >>>>>>>> #3  0x00000000004349b9 in PMPI_Recv (buf=0x1bd96010, count=202,
> >>>>>>>>    datatype=-1946157051, source=40, tag=374, comm=-1006632954, 
> >>>>>>>> status=0x1)
> >>>>>>>>    at recv.c:156
> >>>>>>>>
> >>>>>>>> and on PE with rank *12* I have
> >>>>>>>>
> >>>>>>>> #3  0x00000000004368f4 in PMPI_Waitall (count=8,
> >>>>>>>>    array_of_requests=0x197e6b10, array_of_statuses=0x1)
> >>>>>>>>    at waitall.c:191
> >>>>>>>>
> >>>>>>>> It seems that rank 40 slipped throught the MPI_Waitall 
> >>>>>>>> eventhough he was not supposed to do so ...
> >>>>>>>>
> >>>>>>>> Please find attached the output files. There are three 
> >>>>>>>> processes which seem to be not in the barrier (2 on compute-0-3 
> >>>>>>>> and 1 on compute-0-13 but the one with the short backtrace on 
> >>>>>>>> compute-0-3 is also in the barrier as I could confirm by hand).
> >>>>>>>>
> >>>>>>>> Any hints what might cause this error?
> >>>>>>>>
> >>>>>>>> I'm using the trunk version of mvapich2 (check-out yesterday) 
> >>>>>>>> and the cluster consists of 14 LS22 blades (opteron) with 4x 
> >>>>>>>> DDR Infiniband. I'm not quiet sure which ofed version it is (it 
> >>>>>>>> is delivered with the rocks distribution and they are typically 
> >>>>>>>> not very verbose concerning version numbers ...).
> >>>>>>>>
> >>>>>>>> Thanks for your help,
> >>>>>>>> Dorian
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ------------------------------------------------------------------------ 
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> mvapich-discuss mailing list
> >>>>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>>>>>   
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> 


______________________________________________________
GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de

-------------- next part --------------

--- /home/kraused/Devel/HGC/ParticleSubCommunicationManagerPt2Pt.cc	2009-08-13 17:10:42.000000000 +0200
+++ ParticleSubCommunicationManagerPt2Pt.cc	2009-08-15 00:08:20.000000000 +0200
@@ -16,6 +16,8 @@
 							    int nremote,
 							    const int *sendcount)
 {
+	comm::SET_WORLD.barrier();
+
 	ArrayBase<ParticleSub>::Request *req = 
 			new ArrayBase<ParticleSub>::Request[list.size()];
 
@@ -50,6 +52,8 @@
 							    const int *recvcount,
 							    const ParticleForMeshInfo *info)
 {
+	comm::SET_WORLD.barrier();
+
 	int n = 0;
 	if(info) {
 		n = info->numparticles();