[mvapich-discuss] On "Got Completion" and IBV_EVENT Errors

Joshua Bernstein jbernstein at penguincomputing.com
Mon Jan 28 16:27:25 EST 2008


Hi All,

	I've seen various posts about this error including something that seems 
related from this month, though I never see any resolution.

http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-January/001340.html

When I run a very simple (cpi for example) MVAPICH job using the ch_gen2 
transport, the job starts up, but just seems to hang. After a bit of 
time I am left with this:

[1:n2] Abort: [n2:1] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12
[0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16

This smells like a timeout! So, after reading through some of the 
archives I came across the envar VIADEV_USE_SHMEM_COLL, so setting this 
variable to:

VIADEV_USE_SHMEM_COLL=0

seems to allow the job to get a little further. Because now I get STDIO 
from the process before the hang:

...
Hello from Process 0 on n2
Hello from Process 1 on n2
...

Once again I reach a hang, though this is right where the sample program 
tries to do some MPI communication. The output is as follows:

[0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16
  at line 2554 in file viacheck.c

Again, I've read through the archives and have determined that 
everything seems to check out. ibchecknet and other ibv_ and ib_ 
commands come up clean. Also the osu_* sample tests exhibit the exact 
same behavior.

I'm totally left in the dark now, so any help would be greatly appreciated.

Running: RHEL4u6, OFED1.2, and MVAPICH 0.9.9

-Joshua Bernstein
Software Engineer
Penguin Computing





More information about the mvapich-discuss mailing list