[mvapich-discuss] Collective MPI failure
Vishwas
vvasisht at locuz.com
Sun Dec 17 23:21:36 EST 2006
Hello,
One of my node went down, when my job was running (This job was a farming
job). But the whole MPI on my cluster has failed after this node crash.
I had previously been told to use -env MV2_DEFAULT_TIME_OUT 12, which I have
done.
The following is the error I got again
send desc error
[25] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR,
vendor code=8 1, dest rank=21 at line 410 in file
vapi_channel_manager.c send desc error send desc error [26] Abort: []
Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=8 1,
dest rank=22 at line 410 in file vapi_channel_manager.c [24] Abort: []
Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=8 1,
dest rank=20 at line 410 in file vapi_channel_manager.c send desc error
[27] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR,
vendor code=8 1, dest rank=20 at line 410 in file
vapi_channel_manager.c
rank 191 in job 1 gulabjamun.ncbs.res.in_45275 caused collective
abort
of all
ranks
exit status of rank 191: killed by signal 9
rank 190 in job 1 gulabjamun.ncbs.res.in_45275 caused collective
abort
of all
ranks
exit status of rank 190: killed by signal 9
rank 188 in job 1 gulabjamun.ncbs.res.in_45275 caused collective
abort
of all
ranks
exit status of rank 188: killed by signal 9
rank 26 in job 1 gulabjamun.ncbs.res.in_45275 caused collective abort
of all
ranks
exit status of rank 26: killed by signal 9
rank 24 in job 1 gulabjamun.ncbs.res.in_45275 caused collective abort
of all
ranks
exit status of rank 24: killed by signal 9
Regards
Vishwas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20061218/50eefe9a/attachment-0001.html
More information about the mvapich-discuss
mailing list