[mvapich-discuss] Re: VAPI_PROTOCOL_R3' failed

wei huang huanwei at cse.ohio-state.edu
Wed May 28 15:50:36 EDT 2008


Hi Sangamesh,

Would you please let us know more information so that we can look further
into this issue?

*) Is mvapich2-1.0.2 being used here?

*) Are you using the default compiling scripts and default environment
variables;

*) Is your application using thread at some stage or not?

*) Would you please apply the attached patch which prints out some
information wrt the assertion failure?

*) Would you please try to set the following environment variable
(separately) during your run to see if any one of them helps?

-env MV2_USE_RDMA_FAST_PATH 0
-env MV2_USE_SRQ 0
-env MV2_USE_COALESCE 0
-env MV2_USE_SHM_COLL 0

Thanks

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Wed, 28 May 2008, Dhabaleswar Panda wrote:

> We are taking a look at this error and will get back to you soon.
>
> DK
>
> On Wed, 28 May 2008, Sangamesh B wrote:
>
> > There was no reply from the list.
> >
> > Following is some more info about the DLPOLY + mvapich2 + ofed-1.2.5.5 +
> > Mellanox HCA job, which gets into hang after a certain number of iterations.
> >
> > The same job with mpich2 + ethernet runs fine without any problems. And
> > produces the final result also.
> >
> > With mvapich2, the job runs upto some iterations and stops calculation. It
> > doesn't give any error at this point. But the output file which gets updated
> > at each iteration will not show progress.
> >
> > One more point is, I repeatedly submitted the same mvapich2 job. In each
> > case it stops at same iteration.
> >
> > Any mvapich2 variables have to be set?
> >
> > Thanks,
> > Sangamesh
> >
> > On Tue, May 27, 2008 at 4:30 PM, Sangamesh B <forum.san at gmail.com> wrote:
> >
> > > Hi All,
> > >
> > >    A DL-POLY application job on a 5-node Infiniband setup cluster gave
> > > following error:
> > >
> > > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> > > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> > > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> > > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > > rank 1 in job 20  compute-0-12.local_32785   caused collective abort of all
> > > ranks
> > >   exit status of rank 1: killed by signal 9
> > >
> > > The job runs for 20-30 minutes and gives above error.
> > >
> > > This is with mvapich2 + ofed-1.2.5.5 + Mellanox HCA's.
> > >
> > > Any idea what might be the wrong?
> > >
> > > Thanks,
> > > Sangamesh
> > >
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
Index: src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c
===================================================================
--- src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c	(revision 2532)
+++ src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c	(working copy)
@@ -223,6 +223,15 @@
         case VAPI_PROTOCOL_R3:
             rndv = (cts_pkt == NULL) ? NULL : &cts_pkt->rndv;
             sreq->mrail.partner_id = cts_pkt->receiver_req_id;
+            if (rndv->protocol != VAPI_PROTOCOL_R3) {
+                fprintf(stderr, "Unexpected protocol type %d\n", rndv->protocol);
+                fprintf(stderr, "Rndv buf %p (alloc %d), size %d, offset %d, "
+                        "r_addr %p, d_entry %p\n", 
+                        req->mrail.rndv_buf, req->mrail.rndv_buf_alloc,
+                        req->mrail.rndv_buf_sz, req->mrail.rndv_buf_off, 
+                        req->mrail.remote_addr, req->mrail.d_entry);
+                        
+            }
             assert(rndv->protocol == VAPI_PROTOCOL_R3);
             break;
         case VAPI_PROTOCOL_RGET:


More information about the mvapich-discuss mailing list