[mvapich-discuss] On "Got Completion" and IBV_EVENT Errors

Matthew Koop koop at cse.ohio-state.edu
Tue Jan 29 11:05:07 EST 2008


Joshua,

So are you able to run `ibv_rc_pingpong' with a variety of message sizes?
You may want to double-check that the cables between machines are well
connected as well.

With the earlier request you cited, the issue didn't occur for simple
microbenchmarks, only with an application. We have previously seen issues
when fork or system calls are used in applications (due to
incompatibilities with the underlying OpenFabrics drivers).

It seems that your issue is more likely to be a setup issue. What does
ulimit -l report on your compute nodes? Also, it is unlikely that
VIADEV_USE_SHMEM_COLL is causing any issue -- turning off this option
means there is less communication in the init phase (which allows you to
get to the stdout statements).

Thanks,

Matt

On Mon, 28 Jan 2008, Joshua Bernstein wrote:

> Hi All,
>
> 	I've seen various posts about this error including something that seems
> related from this month, though I never see any resolution.
>
> http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-January/001340.html
>
> When I run a very simple (cpi for example) MVAPICH job using the ch_gen2
> transport, the job starts up, but just seems to hang. After a bit of
> time I am left with this:
>
> [1:n2] Abort: [n2:1] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12
> [0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16
>
> This smells like a timeout! So, after reading through some of the
> archives I came across the envar VIADEV_USE_SHMEM_COLL, so setting this
> variable to:
>
> VIADEV_USE_SHMEM_COLL=0
>
> seems to allow the job to get a little further. Because now I get STDIO
> from the process before the hang:
>
> ...
> Hello from Process 0 on n2
> Hello from Process 1 on n2
> ...
>
> Once again I reach a hang, though this is right where the sample program
> tries to do some MPI communication. The output is as follows:
>
> [0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16
>   at line 2554 in file viacheck.c
>
> Again, I've read through the archives and have determined that
> everything seems to check out. ibchecknet and other ibv_ and ib_
> commands come up clean. Also the osu_* sample tests exhibit the exact
> same behavior.
>
> I'm totally left in the dark now, so any help would be greatly appreciated.
>
> Running: RHEL4u6, OFED1.2, and MVAPICH 0.9.9
>
> -Joshua Bernstein
> Software Engineer
> Penguin Computing
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list