[mvapich-discuss] job aborted after a few days run

Vishwas vvasisht at locuz.com
Thu Nov 30 07:22:32 EST 2006


Hello Axel,

 

No I have not used nonblocking communication, but MPI_Send and MPI_Recv,
i.e., blocking.

 

Regards

Vishwas

 

   _____  

From: Axel Rimanek [mailto:Axel at Rimanek.de] 
Sent: Thursday, November 30, 2006 3:28 PM
To: 'Vishwas'
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: AW: [mvapich-discuss] job aborted after a few days run

 

Hello Vishwas,

did you also use nonblocking communications?

 

Axel

 

   _____  

Von: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-bounces at cse.ohio-state.edu] Im Auftrag von Vishwas
Gesendet: Donnerstag, 30. November 2006 06:46
An: mvapich-discuss at cse.ohio-state.edu
Betreff: [mvapich-discuss] job aborted after a few days run

 

Hello,

 

I was running a farming job on my cluster. After few days of the run, job
got aborted abruptly. The following error generated in the log file.

 

[138] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21

 at line 410 in file vapi_channel_manager.c

send desc error

[131] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23

 at line 410 in file vapi_channel_manager.c

rank 138 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 138: killed by signal 9 

rank 131 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 131: killed by signal 9 

rank 86 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 86: killed by signal 9 

~/ROBUST/nov15_2006_3x7 

~/ROBUST/nov15_2006_3x7 

send desc error

[76] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=20

 at line 410 in file vapi_channel_manager.c

send desc error

[61] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21

 at line 410 in file vapi_channel_manager.c

rank 76 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 76: killed by signal 9 

rank 61 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
all ranks

  exit status of rank 61: killed by signal 9 

~/ROBUST/nov15_2006_3x7 

send desc error

[52] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23

 at line 410 in file vapi_channel_manager.c

send desc error

[39] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=23

 at line 410 in file vapi_channel_manager.c

send desc error

[27] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
code=81, dest rank=21

 at line 410 in file vapi_channel_manager.c

 

Regards

Vishwas


--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM



-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
6:09 PM
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20061130/b37a9269/attachment-0001.html


More information about the mvapich-discuss mailing list