[mvapich-discuss] error IBV_WC_RETRY_EXC_ERR, code=12

Matthew Koop koop at cse.ohio-state.edu
Wed Mar 11 11:55:26 EDT 2009


Hi Pawel,

Is this a cluster that has been recently setup? If so, the
IBV_WC_RETRY_EXC_ERR can come up during an application if there is a loose
cable, bad HCA or a bad switch blade.

Can you try running mpiGraph and see if it shows any problems? You can
download it from:

http://sourceforge.net/projects/mpigraph

Just run one process per node and it will generate a picture of the
"health" of the network. Dark lines will indicate a problem that is due to
hardware.

Matt

On Wed, 11 Mar 2009, Pawel Dziekonski wrote:

> Hello,
>
> I try to run Linpack on my whole cluster and it fails with:
>
> Column=091896 Fraction=0.135 Mflops=15485000.90
> Column=095256 Fraction=0.140 Mflops=15490137.96
> Abort signaled by rank 1036: [wn206:1036] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=91
>
> Exit code -3 signaled from wn206
> Killing remote processes...Abort signaled by rank 1946: [wn320:1946] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=56
>
> Abort signaled by rank 911: [wn190:911] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1415
>
> Abort signaled by rank 927: [wn192:927] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=297
>
> Abort signaled by rank 660: [wn159:660] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1605
>
> Abort signaled by rank 1188: [wn225:1188] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1503
>
> Abort signaled by rank 665: [wn160:665] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1610
>
> Abort signaled by rank 1046: [wn207:1046] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=101
>
> MPI process terminated unexpectedly
> Signal 15 received.
> Signal 15 received.
> connect: Connection timed out
> Signal 15 received.
> connect: Connection timed out
> connect: Connection timed out
> connect: Connection timed out
> [...]
>
> wnXXX are worker nodes in the cluster. Which one from mentioned above
> could be a problem? All of then seem to work fine onthe 1st look.
>
> micro-benchmarks with Linpack on pairs of all nodes work fine too.
>
> I use MVAPICH 1.1 and HPL from Intel MKL on em64t with mellanox HCAs.
>
> thanks in advance, Pawel
>
>
>
> --
> Pawel Dziekonski <pawel.dziekonski at wcss.pl>
> Wroclaw Centre for Networking & Supercomputing, HPC Department
> Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
> phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list