[mvapich-discuss] FATAL event IBV_EVENT_QP_LAST_WQE_REACHED

Matthew Koop koop at cse.ohio-state.edu
Mon Aug 27 14:33:24 EDT 2007


Nathan,

Do you ever see any other errors, or is it just
IBV_EVENT_QP_LAST_WQE_REACHED? Sometimes when a job fails one process will
have an error and then the rest of the processes can exit with another
error status unrelated to the problem.

Can you try running with VIADEV_USE_SHMEM_COLL=0 and see if that makes any
difference?

e.g.
mpirun_rsh -np 4 h1 h1 h2 h2 VIADEV_USE_SHMEM_COLL=0 ./IMB-MPI1

If that doesn't work, you can also try:
VIADEV_USE_COALESCE=0

These will help us narrow down the problem a bit.

Thanks,
Matt


On Tue, 21 Aug 2007, Nathan Dauchy wrote:

> Updated...
>
> Nathan Dauchy wrote:
> > I finally had time to get back to this issue...
> >
> > The OSU benchmarks run fine.
> > The presta benchmarks run fine.
> > I'm getting a segfault with IMB.  Running it under gdb, I grabbed the
> > following backtrace:
> >
> > (gdb) bt
> > #0  0x000000000044a846 in movdqa8 ()
> > #1  0x00000000004490b6 in _intel_fast_memcpy.J ()
> > #2  0x000000000043c03f in smpi_recv_get ()
> > #3  0x000000000043a03b in smpi_net_lookup ()
> > #4  0x0000000000439d61 in MPID_SMP_Check_incoming ()
> > #5  0x000000000042aff8 in viutil_spinandwaitcq ()
> > #6  0x000000000042a426 in MPID_DeviceCheck ()
> > #7  0x000000000043334f in MPID_RecvComplete ()
> > #8  0x00000000004369c5 in MPID_RecvDatatype ()
> > #9  0x000000000040eefe in PMPI_Recv ()
> > #10 0x0000000000407880 in IMB_pingpong (c_info=0x60d790, size=-1765560296,
> >     n_sample=-1765558112, RUN_MODE=0x60e000, time=0x1770)
> >     at IMB_pingpong.c:180
> > #11 0x000000000040636e in IMB_warm_up (c_info=0x60d790, Bmark=0x2a96c3b018,
> >     iter=-1765558112) at IMB_warm_up.c:127
> > #12 0x000000000040393f in main (argc=1, argv=0x7fbfffe508) at IMB.c:262
> >
> > It doesn't look to me like it is actually related to the
> > IBV_EVENT_QP_LAST_WQE_REACHED error, but I'm sure others on this list
> > can tell better than I can.  Does the IMB segfault point to anything in
> > particular?
>
> IMB now runs fine.  I had the library path wrong when compiling.
>
> >> Sayantan Sur wrote:
> >>> Thanks for reporting the problem. The event
> >>> IBV_EVENT_QP_LAST_WQE_REACHED means that the QP (internal InfiniBand
> >>> communication channel) is in an error state and all requests are
> >>> consumed. Could it be related to a setup issue? Can you run any other
> >>> MPI programs such as OSU benchmarks, IMB etc. on all these nodes?
> >>>
>
> Any other ideas?
>
> Thanks,
> Nathan
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list