[mvapich-discuss] Re: VAPI_PROTOCOL_R3' failed

Sangamesh B forum.san at gmail.com
Wed May 28 03:00:39 EDT 2008


There was no reply from the list.

Following is some more info about the DLPOLY + mvapich2 + ofed-1.2.5.5 +
Mellanox HCA job, which gets into hang after a certain number of iterations.

The same job with mpich2 + ethernet runs fine without any problems. And
produces the final result also.

With mvapich2, the job runs upto some iterations and stops calculation. It
doesn't give any error at this point. But the output file which gets updated
at each iteration will not show progress.

One more point is, I repeatedly submitted the same mvapich2 job. In each
case it stops at same iteration.

Any mvapich2 variables have to be set?

Thanks,
Sangamesh

On Tue, May 27, 2008 at 4:30 PM, Sangamesh B <forum.san at gmail.com> wrote:

> Hi All,
>
>    A DL-POLY application job on a 5-node Infiniband setup cluster gave
> following error:
>
> infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> rank 1 in job 20  compute-0-12.local_32785   caused collective abort of all
> ranks
>   exit status of rank 1: killed by signal 9
>
> The job runs for 20-30 minutes and gives above error.
>
> This is with mvapich2 + ofed-1.2.5.5 + Mellanox HCA's.
>
> Any idea what might be the wrong?
>
> Thanks,
> Sangamesh
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080528/c0010b1c/attachment.html


More information about the mvapich-discuss mailing list