[mvapich-discuss] On "Got Completion" and IBV_EVENT Errors
Joshua Bernstein
jbernstein at penguincomputing.com
Mon Jan 28 16:27:25 EST 2008
Hi All,
I've seen various posts about this error including something that seems
related from this month, though I never see any resolution.
http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-January/001340.html
When I run a very simple (cpi for example) MVAPICH job using the ch_gen2
transport, the job starts up, but just seems to hang. After a bit of
time I am left with this:
[1:n2] Abort: [n2:1] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12
[0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16
This smells like a timeout! So, after reading through some of the
archives I came across the envar VIADEV_USE_SHMEM_COLL, so setting this
variable to:
VIADEV_USE_SHMEM_COLL=0
seems to allow the job to get a little further. Because now I get STDIO
from the process before the hang:
...
Hello from Process 0 on n2
Hello from Process 1 on n2
...
Once again I reach a hang, though this is right where the sample program
tries to do some MPI communication. The output is as follows:
[0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16
at line 2554 in file viacheck.c
Again, I've read through the archives and have determined that
everything seems to check out. ibchecknet and other ibv_ and ib_
commands come up clean. Also the osu_* sample tests exhibit the exact
same behavior.
I'm totally left in the dark now, so any help would be greatly appreciated.
Running: RHEL4u6, OFED1.2, and MVAPICH 0.9.9
-Joshua Bernstein
Software Engineer
Penguin Computing
More information about the mvapich-discuss
mailing list