[mvapich-discuss] Re: VAPI_PROTOCOL_R3' failed

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed May 28 07:13:23 EDT 2008


We are taking a look at this error and will get back to you soon.

DK

On Wed, 28 May 2008, Sangamesh B wrote:

> There was no reply from the list.
>
> Following is some more info about the DLPOLY + mvapich2 + ofed-1.2.5.5 +
> Mellanox HCA job, which gets into hang after a certain number of iterations.
>
> The same job with mpich2 + ethernet runs fine without any problems. And
> produces the final result also.
>
> With mvapich2, the job runs upto some iterations and stops calculation. It
> doesn't give any error at this point. But the output file which gets updated
> at each iteration will not show progress.
>
> One more point is, I repeatedly submitted the same mvapich2 job. In each
> case it stops at same iteration.
>
> Any mvapich2 variables have to be set?
>
> Thanks,
> Sangamesh
>
> On Tue, May 27, 2008 at 4:30 PM, Sangamesh B <forum.san at gmail.com> wrote:
>
> > Hi All,
> >
> >    A DL-POLY application job on a 5-node Infiniband setup cluster gave
> > following error:
> >
> > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > rank 1 in job 20  compute-0-12.local_32785   caused collective abort of all
> > ranks
> >   exit status of rank 1: killed by signal 9
> >
> > The job runs for 20-30 minutes and gives above error.
> >
> > This is with mvapich2 + ofed-1.2.5.5 + Mellanox HCA's.
> >
> > Any idea what might be the wrong?
> >
> > Thanks,
> > Sangamesh
> >
>



More information about the mvapich-discuss mailing list