[mvapich-discuss] job aborted after a few days run
Vishwas
vvasisht at locuz.com
Thu Nov 30 07:22:32 EST 2006
Hello Axel,
No I have not used nonblocking communication, but MPI_Send and MPI_Recv,
i.e., blocking.
Regards
Vishwas
_____
From: Axel Rimanek [mailto:Axel at Rimanek.de]
Sent: Thursday, November 30, 2006 3:28 PM
To: 'Vishwas'
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: AW: [mvapich-discuss] job aborted after a few days run
Hello Vishwas,
did you also use nonblocking communications?
Axel
_____
Von: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-bounces at cse.ohio-state.edu] Im Auftrag von Vishwas
Gesendet: Donnerstag, 30. November 2006 06:46
An: mvapich-discuss at cse.ohio-state.edu
Betreff: [mvapich-discuss] job aborted after a few days run
Hello,
I was running a farming job on my cluster. After few days of the run, job
got aborted abruptly. The following error generated in the log file.
[138] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21
at line 410 in file vapi_channel_manager.c
send desc error
[131] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23
at line 410 in file vapi_channel_manager.c
rank 138 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 138: killed by signal 9
rank 131 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 131: killed by signal 9
rank 86 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 86: killed by signal 9
~/ROBUST/nov15_2006_3x7
~/ROBUST/nov15_2006_3x7
send desc error
[76] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=20
at line 410 in file vapi_channel_manager.c
send desc error
[61] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21
at line 410 in file vapi_channel_manager.c
rank 76 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 76: killed by signal 9
rank 61 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 61: killed by signal 9
~/ROBUST/nov15_2006_3x7
send desc error
[52] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23
at line 410 in file vapi_channel_manager.c
send desc error
[39] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23
at line 410 in file vapi_channel_manager.c
send desc error
[27] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21
at line 410 in file vapi_channel_manager.c
Regards
Vishwas
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20061130/b37a9269/attachment-0001.html
More information about the mvapich-discuss
mailing list