[mvapich-discuss] job aborted after a few days run
Vishwas
vvasisht at locuz.com
Fri Dec 1 00:07:32 EST 2006
Hello,
I had reported my job got killed after few days of run giving the following
error.
[138] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21
at line 410 in file vapi_channel_manager.c
send desc error
[131] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23
at line 410 in file vapi_channel_manager.c
rank 138 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 138: killed by signal 9
rank 131 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 131: killed by signal 9
rank 86 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 86: killed by signal 9
~/ROBUST/nov15_2006_3x7
~/ROBUST/nov15_2006_3x7
send desc error
I found some one else had reported the similar error and the solution was as
follows.
Try to increase the VAPI driver timeout parameter, VIADEV_DEFAULT_TIME_OUT,
for the MPI stack. To achieve this, use the '-paramfile filename'
option with
mpirun_rsh. For example, you can run:
/usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun_rsh -np 2
-paramfile ./perfparams -hostfile /root/cluster
/usr/local/ibgd/mpi/osu/gcc/tests/PMB2.2.1/PMB-MPI1
where the file perfparams includes the following line:
VIADEV_DEFAULT_TIME_OUT = 12
I want to know whether same applies to my problem. Please help me out, since
if same happens again, I would loose many days.
Regards
Vishwas
_____
From: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Vishwas
Sent: Thursday, November 30, 2006 5:53 PM
To: 'Axel Rimanek'
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: RE: [mvapich-discuss] job aborted after a few days run
Hello Axel,
No I have not used nonblocking communication, but MPI_Send and MPI_Recv,
i.e., blocking.
Regards
Vishwas
_____
From: Axel Rimanek [mailto:Axel at Rimanek.de]
Sent: Thursday, November 30, 2006 3:28 PM
To: 'Vishwas'
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: AW: [mvapich-discuss] job aborted after a few days run
Hello Vishwas,
did you also use nonblocking communications?
Axel
_____
Von: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-bounces at cse.ohio-state.edu] Im Auftrag von Vishwas
Gesendet: Donnerstag, 30. November 2006 06:46
An: mvapich-discuss at cse.ohio-state.edu
Betreff: [mvapich-discuss] job aborted after a few days run
Hello,
I was running a farming job on my cluster. After few days of the run, job
got aborted abruptly. The following error generated in the log file.
[138] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21
at line 410 in file vapi_channel_manager.c
send desc error
[131] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23
at line 410 in file vapi_channel_manager.c
rank 138 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 138: killed by signal 9
rank 131 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 131: killed by signal 9
rank 86 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 86: killed by signal 9
~/ROBUST/nov15_2006_3x7
~/ROBUST/nov15_2006_3x7
send desc error
[76] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=20
at line 410 in file vapi_channel_manager.c
send desc error
[61] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21
at line 410 in file vapi_channel_manager.c
rank 76 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 76: killed by signal 9
rank 61 in job 9 gulabjamun.ncbs.res.in_34137 caused collective abort of
all ranks
exit status of rank 61: killed by signal 9
~/ROBUST/nov15_2006_3x7
send desc error
[52] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23
at line 410 in file vapi_channel_manager.c
send desc error
[39] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23
at line 410 in file vapi_channel_manager.c
send desc error
[27] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21
at line 410 in file vapi_channel_manager.c
Regards
Vishwas
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.15.2/560 - Release Date: 11/30/2006
3:41 PM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20061201/43f0bf13/attachment-0001.html
More information about the mvapich-discuss
mailing list