[mvapich-discuss] Collective MPI failure

Matthew Koop koop at cse.ohio-state.edu
Wed Dec 20 16:02:45 EST 2006


Vishwas,

It seems likely that you may have some loose cable somewhere still. Have
you also made sure that you have the latest firmware for your cards?

You can try increasing MV2_DEFAULT_TIME_OUT to something like 16, but I
don't think this is your issue.

Matt



On Mon, 18 Dec 2006, Vishwas wrote:

> Hello,
>
>
>
> One of my node went down, when my job was running (This job was a farming
> job). But the whole MPI on my cluster has failed after this node crash.
>
> I had previously been told to use -env MV2_DEFAULT_TIME_OUT 12, which I have
> done.
>
>
>
> The following is the error I got again
>
>
>
> send desc error
>
> [25] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR,
>
> vendor code=8 1, dest rank=21  at line 410 in file
>
> vapi_channel_manager.c send desc error send desc error [26] Abort: []
>
> Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=8 1,
>
> dest rank=22  at line 410 in file vapi_channel_manager.c [24] Abort: []
>
> Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=8 1,
>
> dest rank=20  at line 410 in file vapi_channel_manager.c send desc error
>
> [27] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR,
>
> vendor code=8 1, dest rank=20  at line 410 in file
>
> vapi_channel_manager.c
>
> rank 191 in job 1  gulabjamun.ncbs.res.in_45275   caused collective
>
> abort
>
> of all
>
>  ranks
>
>   exit status of rank 191: killed by signal 9
>
> rank 190 in job 1  gulabjamun.ncbs.res.in_45275   caused collective
>
> abort
>
> of all
>
>  ranks
>
>   exit status of rank 190: killed by signal 9
>
> rank 188 in job 1  gulabjamun.ncbs.res.in_45275   caused collective
>
> abort
>
> of all
>
>  ranks
>
>   exit status of rank 188: killed by signal 9
>
> rank 26 in job 1  gulabjamun.ncbs.res.in_45275   caused collective abort
>
> of all
>
> ranks
>
>   exit status of rank 26: killed by signal 9
>
> rank 24 in job 1  gulabjamun.ncbs.res.in_45275   caused collective abort
>
> of all
>
> ranks
>
>   exit status of rank 24: killed by signal 9
>
>
>
>
>
> Regards
>
> Vishwas
>
>



More information about the mvapich-discuss mailing list