[mvapich-discuss] FATAL event IBV_EVENT_QP_LAST_WQE_REACHED

Nathan Dauchy Nathan.Dauchy at noaa.gov
Tue Aug 21 15:21:54 EDT 2007


I finally had time to get back to this issue...

The OSU benchmarks run fine.
The presta benchmarks run fine.
I'm getting a segfault with IMB.  Running it under gdb, I grabbed the
following backtrace:

(gdb) bt
#0  0x000000000044a846 in movdqa8 ()
#1  0x00000000004490b6 in _intel_fast_memcpy.J ()
#2  0x000000000043c03f in smpi_recv_get ()
#3  0x000000000043a03b in smpi_net_lookup ()
#4  0x0000000000439d61 in MPID_SMP_Check_incoming ()
#5  0x000000000042aff8 in viutil_spinandwaitcq ()
#6  0x000000000042a426 in MPID_DeviceCheck ()
#7  0x000000000043334f in MPID_RecvComplete ()
#8  0x00000000004369c5 in MPID_RecvDatatype ()
#9  0x000000000040eefe in PMPI_Recv ()
#10 0x0000000000407880 in IMB_pingpong (c_info=0x60d790, size=-1765560296,
    n_sample=-1765558112, RUN_MODE=0x60e000, time=0x1770)
    at IMB_pingpong.c:180
#11 0x000000000040636e in IMB_warm_up (c_info=0x60d790, Bmark=0x2a96c3b018,
    iter=-1765558112) at IMB_warm_up.c:127
#12 0x000000000040393f in main (argc=1, argv=0x7fbfffe508) at IMB.c:262

It doesn't look to me like it is actually related to the
IBV_EVENT_QP_LAST_WQE_REACHED error, but I'm sure others on this list
can tell better than I can.  Does the IMB segfault point to anything in
particular?

Thanks,
Nathan


Nathan Dauchy wrote:
> Sayantan,
> 
> Thanks for the quick reply!
> 
> We were previously running openIB gen2 (r-7.6.07 I think) and MVAPICH
> 0.9.8.  In that environment, several benchmarks were run, as well as
> many user codes, and I'm not aware of any problems related to
> QP_LAST_WQE_REACHED.
> 
> Since upgrading, we have not yet run many benchmarks.  I'm out of the
> office for the next week but will see about running the ones you suggest
> when I get back.
> 
> Thanks,
> Nathan
> 
> 
> Sayantan Sur wrote:
>> Hi Nathan,
>>
>> Nathan Dauchy wrote:
>>> Pierrick, All,
>>>
>>> We recently upgraded to OFED-1.2 (mvapich-0.9.9) and are now getting an
>>> error that looks similar to yours:
>>>
>>> [0:w72] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED,
>>> code=16 at line 2552 in file viacheck.c
>>>
>>> Did you ever find a solution?  (I can't find one in the archives.)
>>>
>>> Can someone explain what the IBV_EVENT_QP_LAST_WQE_REACHED error means?
>>>   I can't find any clues in the source, and have been unable to turn up
>>> any relevant docs either.
>>>
>>> Thanks for any help and clues you can offer!
>>>   
>>
>> Thanks for reporting the problem. The event
>> IBV_EVENT_QP_LAST_WQE_REACHED means that the QP (internal InfiniBand
>> communication channel) is in an error state and all requests are
>> consumed. Could it be related to a setup issue? Can you run any other
>> MPI programs such as OSU benchmarks, IMB etc. on all these nodes?
>>
>> Thanks,
>> Sayantan.
>>
>>> Regards,
>>> Nathan
>>>
>>>
>>> Pierrick Penven, Tue Mar 27 12:33:03 EDT 2007:
>>>  
>>>> Dear all,
>>>>
>>>> I am trying to install an ocean model on a  cluster based on 64 bit
>>>> bi-dual core AMD opterons with infiniband using pathf90 and mvapich
>>>> v0.9.8.
>>>>
>>>> The model is runs and scales well on 1 node, but is not able to run
>>>> on several nodes.  I have tried to used mvapich v0.9.9.beta, and I
>>>> get the following message:
>>>>
>>>> [1:chpcc060] Abort: [1:chpcc060] Abort: [chpcc060:1] Got completion
>>>> with error IBV_WC_LOC_PROT_ERR, code=4 at line 2374 in file viacheck.c
>>>> [1] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 at line
>>>> 2554 in file viacheck.c
>>>> [0:chpcc058] Abort: [0] Got FATAL event
>>>> IBV_EVENT_QP_LAST_WQE_REACHED, code=16
>>>>  at line 2554 in file viacheck.c
>>>> /CHPC/usr/local/mvapich_099b/bin/mpirun: line 1: 17412
>>>> Terminated              /CHPC/usr/local/mvapich_099b/bin/mpirun_rsh
>>>> -np 2 -hostfile /CHPC/home/loadl/execute/chpcln.4269.0.machinefile
>>>> /CHPC/home/ppenven/Roms_tools/TEST1/./roms
>>>>
>>>> The problem does not occur using the MPI over IP rather than VAPI.
>>>>
>>>> Is there a solution to this problem ?
>>>>
>>>> Thanks a lot
>>>>
>>>> Pierrick
>>>>     
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>   
>>
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss




More information about the mvapich-discuss mailing list