[mvapich-discuss] help: Poll CQ failed!

Jeff Haferman jeff at haferman.com
Thu Dec 31 14:48:47 EST 2009


I upgraded the firmware on our HCAs today to the latest available on the
Mellanox website (2.6.0) but am still having the RDMA problems.

So, this is clearly not an mvapich issue and I'm going to work with
Sun and Mellanox and also the linux-rdma mailing list to see if I can
get to the bottom of this.  I'll try to report back here once I resolve
the issue.  Thanks for all the help.

Jeff


Jeff Haferman wrote:
> 
> DK -
> Thanks, this is helpful.
> 
> I tried ib_rdma_lat between 2 nodes and received
> Conflicting CPU frequency values detected: 2336.000000 != 2003.000000
> Latency typical: inf usec
> Latency best   : inf usec
> Latency worst  : inf usec
> 
> I tried rping between 2 nodes using
> rping -S 100 -d -v -c -a compute-ib-1-0
> on the client side and received
> verbose
> client
> created cm_id 0x14490c70
> cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x14490c70 (parent)
> cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x14490c70 (parent)
> rdma_resolve_addr - rdma_resolve_route successful
> created pd 0x144933d0
> created channel 0x144933f0
> created cq 0x14493410
> created qp 0x14493550
> rping_setup_buffers called on cb 0x1448e010
> allocated & registered buffers...
> cq_thread started.
> cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x14490c70 (parent)
> ESTABLISHED
> rmda_connect successful
> RDMA addr 14493a90 rkey e002801 len 100
> send completion
> cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x14490c70 (parent)
> client DISCONNECT EVENT...
> wait for RDMA_WRITE_ADV state 6
> poll error -2
> rping_free_buffers called on cb 0x1448e010
> destroy cm_id 0x14490c70
> 
> I did a little bit of searching and found a message on openfabrics.org that
> was similar and suggests that we might need to upgrade firmware on our
> switch, so, I will look into that.  Any other ideas would be appreciated.
> 
> Jeff
> 



More information about the mvapich-discuss mailing list