[mvapich-discuss] error IBV_WC_RETRY_EXC_ERR, code=12

Pawel Dziekonski dzieko at wcss.pl
Wed Mar 11 03:26:16 EDT 2009


Hello,

I try to run Linpack on my whole cluster and it fails with:

Column=091896 Fraction=0.135 Mflops=15485000.90
Column=095256 Fraction=0.140 Mflops=15490137.96
Abort signaled by rank 1036: [wn206:1036] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=91

Exit code -3 signaled from wn206
Killing remote processes...Abort signaled by rank 1946: [wn320:1946] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=56

Abort signaled by rank 911: [wn190:911] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1415

Abort signaled by rank 927: [wn192:927] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=297

Abort signaled by rank 660: [wn159:660] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1605

Abort signaled by rank 1188: [wn225:1188] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1503

Abort signaled by rank 665: [wn160:665] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1610

Abort signaled by rank 1046: [wn207:1046] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=101

MPI process terminated unexpectedly
Signal 15 received.
Signal 15 received.
connect: Connection timed out
Signal 15 received.
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
[...]

wnXXX are worker nodes in the cluster. Which one from mentioned above
could be a problem? All of then seem to work fine onthe 1st look.

micro-benchmarks with Linpack on pairs of all nodes work fine too.

I use MVAPICH 1.1 and HPL from Intel MKL on em64t with mellanox HCAs.

thanks in advance, Pawel



-- 
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl


More information about the mvapich-discuss mailing list