[mvapich-discuss] job aborted after a few days run

Vishwas vvasisht at locuz.com
Fri Dec 1 00:07:32 EST 2006


Hello,

 

I had reported my job got killed after few days of run giving the following
error.

 

[138] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21

 at line 410 in file vapi_channel_manager.c

send desc error

[131] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23

 at line 410 in file vapi_channel_manager.c

rank 138 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 138: killed by signal 9 

rank 131 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 131: killed by signal 9 

rank 86 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 86: killed by signal 9 

~/ROBUST/nov15_2006_3x7 

~/ROBUST/nov15_2006_3x7 

send desc error

 

I found some one else had reported the similar error and the solution was as
follows.

 

Try to increase the VAPI driver timeout parameter, VIADEV_DEFAULT_TIME_OUT,
        for the MPI stack. To achieve  this, use the '-paramfile filename'
option with
        mpirun_rsh. For example, you can run:
 
         /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun_rsh -np 2
-paramfile ./perfparams -hostfile /root/cluster
/usr/local/ibgd/mpi/osu/gcc/tests/PMB2.2.1/PMB-MPI1
 
          where the file perfparams includes the following line:
        VIADEV_DEFAULT_TIME_OUT = 12

 

I want to know whether same applies to my problem. Please help me out, since
if same happens again, I would loose many days.

 

Regards

Vishwas

   _____  

From: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Vishwas
Sent: Thursday, November 30, 2006 5:53 PM
To: 'Axel Rimanek'
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: RE: [mvapich-discuss] job aborted after a few days run

 

Hello Axel,

 

No I have not used nonblocking communication, but MPI_Send and MPI_Recv,
i.e., blocking.

 

Regards

Vishwas

 

   _____  

From: Axel Rimanek [mailto:Axel at Rimanek.de] 
Sent: Thursday, November 30, 2006 3:28 PM
To: 'Vishwas'
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: AW: [mvapich-discuss] job aborted after a few days run

 

Hello Vishwas,

did you also use nonblocking communications?

 

Axel

 

   _____  

Von: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-bounces at cse.ohio-state.edu] Im Auftrag von Vishwas
Gesendet: Donnerstag, 30. November 2006 06:46
An: mvapich-discuss at cse.ohio-state.edu
Betreff: [mvapich-discuss] job aborted after a few days run

 

Hello,

 

I was running a farming job on my cluster. After few days of the run, job
got aborted abruptly. The following error generated in the log file.

 

[138] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21

 at line 410 in file vapi_channel_manager.c

send desc error

[131] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23

 at line 410 in file vapi_channel_manager.c

rank 138 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 138: killed by signal 9 

rank 131 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 131: killed by signal 9 

rank 86 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 86: killed by signal 9 

~/ROBUST/nov15_2006_3x7 

~/ROBUST/nov15_2006_3x7 

send desc error

[76] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=20

 at line 410 in file vapi_channel_manager.c

send desc error

[61] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21

 at line 410 in file vapi_channel_manager.c

rank 76 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 76: killed by signal 9 

rank 61 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 61: killed by signal 9 

~/ROBUST/nov15_2006_3x7 

send desc error

[52] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23

 at line 410 in file vapi_channel_manager.c

send desc error

[39] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23

 at line 410 in file vapi_channel_manager.c

send desc error

[27] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21

 at line 410 in file vapi_channel_manager.c

 

Regards

Vishwas


--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM



--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM



-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.15.2/560 - Release Date: 11/30/2006
3:41 PM
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20061201/43f0bf13/attachment-0001.html


More information about the mvapich-discuss mailing list