[mvapich-discuss] FATAL event IBV_EVENT_QP_LAST_WQE_REACHED
Nathan Dauchy
Nathan.Dauchy at noaa.gov
Tue Aug 21 15:21:54 EDT 2007
I finally had time to get back to this issue...
The OSU benchmarks run fine.
The presta benchmarks run fine.
I'm getting a segfault with IMB. Running it under gdb, I grabbed the
following backtrace:
(gdb) bt
#0 0x000000000044a846 in movdqa8 ()
#1 0x00000000004490b6 in _intel_fast_memcpy.J ()
#2 0x000000000043c03f in smpi_recv_get ()
#3 0x000000000043a03b in smpi_net_lookup ()
#4 0x0000000000439d61 in MPID_SMP_Check_incoming ()
#5 0x000000000042aff8 in viutil_spinandwaitcq ()
#6 0x000000000042a426 in MPID_DeviceCheck ()
#7 0x000000000043334f in MPID_RecvComplete ()
#8 0x00000000004369c5 in MPID_RecvDatatype ()
#9 0x000000000040eefe in PMPI_Recv ()
#10 0x0000000000407880 in IMB_pingpong (c_info=0x60d790, size=-1765560296,
n_sample=-1765558112, RUN_MODE=0x60e000, time=0x1770)
at IMB_pingpong.c:180
#11 0x000000000040636e in IMB_warm_up (c_info=0x60d790, Bmark=0x2a96c3b018,
iter=-1765558112) at IMB_warm_up.c:127
#12 0x000000000040393f in main (argc=1, argv=0x7fbfffe508) at IMB.c:262
It doesn't look to me like it is actually related to the
IBV_EVENT_QP_LAST_WQE_REACHED error, but I'm sure others on this list
can tell better than I can. Does the IMB segfault point to anything in
particular?
Thanks,
Nathan
Nathan Dauchy wrote:
> Sayantan,
>
> Thanks for the quick reply!
>
> We were previously running openIB gen2 (r-7.6.07 I think) and MVAPICH
> 0.9.8. In that environment, several benchmarks were run, as well as
> many user codes, and I'm not aware of any problems related to
> QP_LAST_WQE_REACHED.
>
> Since upgrading, we have not yet run many benchmarks. I'm out of the
> office for the next week but will see about running the ones you suggest
> when I get back.
>
> Thanks,
> Nathan
>
>
> Sayantan Sur wrote:
>> Hi Nathan,
>>
>> Nathan Dauchy wrote:
>>> Pierrick, All,
>>>
>>> We recently upgraded to OFED-1.2 (mvapich-0.9.9) and are now getting an
>>> error that looks similar to yours:
>>>
>>> [0:w72] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED,
>>> code=16 at line 2552 in file viacheck.c
>>>
>>> Did you ever find a solution? (I can't find one in the archives.)
>>>
>>> Can someone explain what the IBV_EVENT_QP_LAST_WQE_REACHED error means?
>>> I can't find any clues in the source, and have been unable to turn up
>>> any relevant docs either.
>>>
>>> Thanks for any help and clues you can offer!
>>>
>>
>> Thanks for reporting the problem. The event
>> IBV_EVENT_QP_LAST_WQE_REACHED means that the QP (internal InfiniBand
>> communication channel) is in an error state and all requests are
>> consumed. Could it be related to a setup issue? Can you run any other
>> MPI programs such as OSU benchmarks, IMB etc. on all these nodes?
>>
>> Thanks,
>> Sayantan.
>>
>>> Regards,
>>> Nathan
>>>
>>>
>>> Pierrick Penven, Tue Mar 27 12:33:03 EDT 2007:
>>>
>>>> Dear all,
>>>>
>>>> I am trying to install an ocean model on a cluster based on 64 bit
>>>> bi-dual core AMD opterons with infiniband using pathf90 and mvapich
>>>> v0.9.8.
>>>>
>>>> The model is runs and scales well on 1 node, but is not able to run
>>>> on several nodes. I have tried to used mvapich v0.9.9.beta, and I
>>>> get the following message:
>>>>
>>>> [1:chpcc060] Abort: [1:chpcc060] Abort: [chpcc060:1] Got completion
>>>> with error IBV_WC_LOC_PROT_ERR, code=4 at line 2374 in file viacheck.c
>>>> [1] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 at line
>>>> 2554 in file viacheck.c
>>>> [0:chpcc058] Abort: [0] Got FATAL event
>>>> IBV_EVENT_QP_LAST_WQE_REACHED, code=16
>>>> at line 2554 in file viacheck.c
>>>> /CHPC/usr/local/mvapich_099b/bin/mpirun: line 1: 17412
>>>> Terminated /CHPC/usr/local/mvapich_099b/bin/mpirun_rsh
>>>> -np 2 -hostfile /CHPC/home/loadl/execute/chpcln.4269.0.machinefile
>>>> /CHPC/home/ppenven/Roms_tools/TEST1/./roms
>>>>
>>>> The problem does not occur using the MPI over IP rather than VAPI.
>>>>
>>>> Is there a solution to this problem ?
>>>>
>>>> Thanks a lot
>>>>
>>>> Pierrick
>>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list