[mvapich-discuss] Collective MPI failure

Vishwas vvasisht at locuz.com
Sun Dec 17 23:21:36 EST 2006


Hello,

 

One of my node went down, when my job was running (This job was a farming
job). But the whole MPI on my cluster has failed after this node crash.

I had previously been told to use -env MV2_DEFAULT_TIME_OUT 12, which I have
done.

 

The following is the error I got again

 

send desc error

[25] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR,

vendor code=8 1, dest rank=21  at line 410 in file

vapi_channel_manager.c send desc error send desc error [26] Abort: []

Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=8 1,

dest rank=22  at line 410 in file vapi_channel_manager.c [24] Abort: []

Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=8 1,

dest rank=20  at line 410 in file vapi_channel_manager.c send desc error

[27] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR,

vendor code=8 1, dest rank=20  at line 410 in file

vapi_channel_manager.c

rank 191 in job 1  gulabjamun.ncbs.res.in_45275   caused collective

abort

of all

 ranks

  exit status of rank 191: killed by signal 9

rank 190 in job 1  gulabjamun.ncbs.res.in_45275   caused collective

abort

of all

 ranks

  exit status of rank 190: killed by signal 9

rank 188 in job 1  gulabjamun.ncbs.res.in_45275   caused collective

abort

of all

 ranks

  exit status of rank 188: killed by signal 9

rank 26 in job 1  gulabjamun.ncbs.res.in_45275   caused collective abort

of all

ranks

  exit status of rank 26: killed by signal 9

rank 24 in job 1  gulabjamun.ncbs.res.in_45275   caused collective abort

of all

ranks

  exit status of rank 24: killed by signal 9

 

 

Regards

Vishwas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20061218/50eefe9a/attachment-0001.html


More information about the mvapich-discuss mailing list