[mvapich-discuss] job aborted after a few days run

wei huang huanwei at cse.ohio-state.edu
Fri Dec 1 00:23:35 EST 2006


Hi Vishwas,

You can try to set this paramter by :

mpiexec -n N -env MV2_DEFAULT_TIME_OUT 12 ./a.out

VAPI_RETRY_EXC_ERR may also happen if one of your process meets quits for
some reason (i.e., segfaul). If possible, could you please specifically
pay attention to which process may be the first one to fail and get the
core dump? With current information it is hard to locate the problem you
have.

Thanks

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Fri, 1 Dec 2006, Vishwas wrote:

> Hello,
>
>
>
> I had reported my job got killed after few days of run giving the following
> error.
>
>
>
> [138] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
> code=81, dest rank=21
>
>  at line 410 in file vapi_channel_manager.c
>
> send desc error
>
> [131] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
> code=81, dest rank=23
>
>  at line 410 in file vapi_channel_manager.c
>
> rank 138 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
> all ranks
>
>   exit status of rank 138: killed by signal 9
>
> rank 131 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
> all ranks
>
>   exit status of rank 131: killed by signal 9
>
> rank 86 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
> all ranks
>
>   exit status of rank 86: killed by signal 9
>
> ~/ROBUST/nov15_2006_3x7
>
> ~/ROBUST/nov15_2006_3x7
>
> send desc error
>
>
>
> I found some one else had reported the similar error and the solution was as
> follows.
>
>
>
> Try to increase the VAPI driver timeout parameter, VIADEV_DEFAULT_TIME_OUT,
>         for the MPI stack. To achieve  this, use the '-paramfile filename'
> option with
>         mpirun_rsh. For example, you can run:
>
>          /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun_rsh -np 2
> -paramfile ./perfparams -hostfile /root/cluster
> /usr/local/ibgd/mpi/osu/gcc/tests/PMB2.2.1/PMB-MPI1
>
>           where the file perfparams includes the following line:
>         VIADEV_DEFAULT_TIME_OUT = 12
>
>
>
> I want to know whether same applies to my problem. Please help me out, since
> if same happens again, I would loose many days.
>
>
>
> Regards
>
> Vishwas
>
>    _____
>
> From: mvapich-discuss-bounces at cse.ohio-state.edu
> [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Vishwas
> Sent: Thursday, November 30, 2006 5:53 PM
> To: 'Axel Rimanek'
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: RE: [mvapich-discuss] job aborted after a few days run
>
>
>
> Hello Axel,
>
>
>
> No I have not used nonblocking communication, but MPI_Send and MPI_Recv,
> i.e., blocking.
>
>
>
> Regards
>
> Vishwas
>
>
>
>    _____
>
> From: Axel Rimanek [mailto:Axel at Rimanek.de]
> Sent: Thursday, November 30, 2006 3:28 PM
> To: 'Vishwas'
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: AW: [mvapich-discuss] job aborted after a few days run
>
>
>
> Hello Vishwas,
>
> did you also use nonblocking communications?
>
>
>
> Axel
>
>
>
>    _____
>
> Von: mvapich-discuss-bounces at cse.ohio-state.edu
> [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] Im Auftrag von Vishwas
> Gesendet: Donnerstag, 30. November 2006 06:46
> An: mvapich-discuss at cse.ohio-state.edu
> Betreff: [mvapich-discuss] job aborted after a few days run
>
>
>
> Hello,
>
>
>
> I was running a farming job on my cluster. After few days of the run, job
> got aborted abruptly. The following error generated in the log file.
>
>
>
> [138] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
> code=81, dest rank=21
>
>  at line 410 in file vapi_channel_manager.c
>
> send desc error
>
> [131] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
> code=81, dest rank=23
>
>  at line 410 in file vapi_channel_manager.c
>
> rank 138 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
> all ranks
>
>   exit status of rank 138: killed by signal 9
>
> rank 131 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
> all ranks
>
>   exit status of rank 131: killed by signal 9
>
> rank 86 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
> all ranks
>
>   exit status of rank 86: killed by signal 9
>
> ~/ROBUST/nov15_2006_3x7
>
> ~/ROBUST/nov15_2006_3x7
>
> send desc error
>
> [76] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
> code=81, dest rank=20
>
>  at line 410 in file vapi_channel_manager.c
>
> send desc error
>
> [61] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
> code=81, dest rank=21
>
>  at line 410 in file vapi_channel_manager.c
>
> rank 76 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
> all ranks
>
>   exit status of rank 76: killed by signal 9
>
> rank 61 in job 9  gulabjamun.ncbs.res.in_34137   caused collective abort of
> all ranks
>
>   exit status of rank 61: killed by signal 9
>
> ~/ROBUST/nov15_2006_3x7
>
> send desc error
>
> [52] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
> code=81, dest rank=23
>
>  at line 410 in file vapi_channel_manager.c
>
> send desc error
>
> [39] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
> code=81, dest rank=23
>
>  at line 410 in file vapi_channel_manager.c
>
> send desc error
>
> [27] Abort: [] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor
> code=81, dest rank=21
>
>  at line 410 in file vapi_channel_manager.c
>
>
>
> Regards
>
> Vishwas
>
>
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
> 6:09 PM
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
> 6:09 PM
>
>
>
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
> 6:09 PM
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.430 / Virus Database: 268.14.19/555 - Release Date: 11/27/2006
> 6:09 PM
>
>
>
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.5.430 / Virus Database: 268.15.2/560 - Release Date: 11/30/2006
> 3:41 PM
>
>



More information about the mvapich-discuss mailing list