[mvapich-discuss] Re: VAPI_PROTOCOL_R3' failed
wei huang
huanwei at cse.ohio-state.edu
Wed May 28 15:50:36 EDT 2008
Hi Sangamesh,
Would you please let us know more information so that we can look further
into this issue?
*) Is mvapich2-1.0.2 being used here?
*) Are you using the default compiling scripts and default environment
variables;
*) Is your application using thread at some stage or not?
*) Would you please apply the attached patch which prints out some
information wrt the assertion failure?
*) Would you please try to set the following environment variable
(separately) during your run to see if any one of them helps?
-env MV2_USE_RDMA_FAST_PATH 0
-env MV2_USE_SRQ 0
-env MV2_USE_COALESCE 0
-env MV2_USE_SHM_COLL 0
Thanks
Regards,
Wei Huang
774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501
On Wed, 28 May 2008, Dhabaleswar Panda wrote:
> We are taking a look at this error and will get back to you soon.
>
> DK
>
> On Wed, 28 May 2008, Sangamesh B wrote:
>
> > There was no reply from the list.
> >
> > Following is some more info about the DLPOLY + mvapich2 + ofed-1.2.5.5 +
> > Mellanox HCA job, which gets into hang after a certain number of iterations.
> >
> > The same job with mpich2 + ethernet runs fine without any problems. And
> > produces the final result also.
> >
> > With mvapich2, the job runs upto some iterations and stops calculation. It
> > doesn't give any error at this point. But the output file which gets updated
> > at each iteration will not show progress.
> >
> > One more point is, I repeatedly submitted the same mvapich2 job. In each
> > case it stops at same iteration.
> >
> > Any mvapich2 variables have to be set?
> >
> > Thanks,
> > Sangamesh
> >
> > On Tue, May 27, 2008 at 4:30 PM, Sangamesh B <forum.san at gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > A DL-POLY application job on a 5-node Infiniband setup cluster gave
> > > following error:
> > >
> > > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> > > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> > > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> > > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > > rank 1 in job 20 compute-0-12.local_32785 caused collective abort of all
> > > ranks
> > > exit status of rank 1: killed by signal 9
> > >
> > > The job runs for 20-30 minutes and gives above error.
> > >
> > > This is with mvapich2 + ofed-1.2.5.5 + Mellanox HCA's.
> > >
> > > Any idea what might be the wrong?
> > >
> > > Thanks,
> > > Sangamesh
> > >
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
Index: src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c
===================================================================
--- src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c (revision 2532)
+++ src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c (working copy)
@@ -223,6 +223,15 @@
case VAPI_PROTOCOL_R3:
rndv = (cts_pkt == NULL) ? NULL : &cts_pkt->rndv;
sreq->mrail.partner_id = cts_pkt->receiver_req_id;
+ if (rndv->protocol != VAPI_PROTOCOL_R3) {
+ fprintf(stderr, "Unexpected protocol type %d\n", rndv->protocol);
+ fprintf(stderr, "Rndv buf %p (alloc %d), size %d, offset %d, "
+ "r_addr %p, d_entry %p\n",
+ req->mrail.rndv_buf, req->mrail.rndv_buf_alloc,
+ req->mrail.rndv_buf_sz, req->mrail.rndv_buf_off,
+ req->mrail.remote_addr, req->mrail.d_entry);
+
+ }
assert(rndv->protocol == VAPI_PROTOCOL_R3);
break;
case VAPI_PROTOCOL_RGET:
More information about the mvapich-discuss
mailing list