[mvapich-discuss] error IBV_WC_RETRY_EXC_ERR, code=12
Pawel Dziekonski
dzieko at wcss.pl
Wed Mar 11 03:26:16 EDT 2009
Hello,
I try to run Linpack on my whole cluster and it fails with:
Column=091896 Fraction=0.135 Mflops=15485000.90
Column=095256 Fraction=0.140 Mflops=15490137.96
Abort signaled by rank 1036: [wn206:1036] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=91
Exit code -3 signaled from wn206
Killing remote processes...Abort signaled by rank 1946: [wn320:1946] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=56
Abort signaled by rank 911: [wn190:911] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1415
Abort signaled by rank 927: [wn192:927] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=297
Abort signaled by rank 660: [wn159:660] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1605
Abort signaled by rank 1188: [wn225:1188] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1503
Abort signaled by rank 665: [wn160:665] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1610
Abort signaled by rank 1046: [wn207:1046] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=101
MPI process terminated unexpectedly
Signal 15 received.
Signal 15 received.
connect: Connection timed out
Signal 15 received.
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
[...]
wnXXX are worker nodes in the cluster. Which one from mentioned above
could be a problem? All of then seem to work fine onthe 1st look.
micro-benchmarks with Linpack on pairs of all nodes work fine too.
I use MVAPICH 1.1 and HPL from Intel MKL on em64t with mellanox HCAs.
thanks in advance, Pawel
--
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl
More information about the mvapich-discuss
mailing list