[mvapich-discuss] BUG in MVAPICH2-1.2p1 - OFA (RDMA) inside vbuf.c file while calling deallocate_vbufs()

Karthik Gopalakrishnan gopalakk at cse.ohio-state.edu
Wed Aug 12 00:03:18 EDT 2009


Hi Polk.

Thanks for the patch. We will test and apply this to the final release
of MVAPICH2-1.4.

I agree that it is good programming practice to release locks that
have been previously acquired, before you exit. However, it should not
make a difference even if we exit from the library, without releasing
the spinlock protecting the vbuf list head, which is internal to the
MVAPICH2 stack. MVAPICH2-1.2 library does run in user space. It
*should* not result in a Kernel Panic. If it does, it means that some
bug in the IB Core / Driver code has been exposed, and should be fixed
there.

Can you please tell me how the function call stack appears during the
panic? Does this happen consistently? Does your patch fix the said
issue?

Regards,
Karthik

On Tue, Aug 11, 2009 at 1:18 AM, gossips J<polk678 at gmail.com> wrote:
> Hi,
> It is observed that while deallocate_vbufs() there is error handling for
> ibv_dereg_mr() API.
> This, if it fails, mvapich2 goes for ibv_error_abort() call.
> Now before doing all these stuff it has been observed that there is spin
> lock acquired for vBUF.
> ++++
> pthread_spin_lock(&vbuf_lock);
> ++++
> So ideally before calling ibv_error_abort(), it should release this spin
> lock as well.
> If this is not done and MR dereg fails, OS gives kernel panic since spin
> lock has not been released.
> This seems BUG in mvapich2-1.2p1-1.src.rpm coming with OFED-1.4.1-GA.
> Following patch should fix this:
> ++++++++
> --- src/mpid/ch3/channels/mrail/src/gen2/vbuf.c
> +++ src/mpid/ch3/channels/mrail/src/gen2/vbuf_fixed.c
> @@ -105,6 +105,7 @@ int init_vbuf_lock()
>  void deallocate_vbufs(int hca_num)
>  {
>      vbuf_region *r = vbuf_region_head;
> +    int err = 0;
>  #if !defined(CKPT)
>      if (MPIDI_CH3I_RDMA_Process.has_srq
> @@ -122,7 +123,8 @@ void deallocate_vbufs(int hca_num)
>          if (r->mem_handle[hca_num] != NULL
>              && ibv_dereg_mr(r->mem_handle[hca_num]))
>          {
> -            ibv_error_abort(IBV_RETURN_ERR, "could not deregister MR");
> +            err = -1;
> +           break;
>          }
>          DEBUG_PRINT("deregister vbufs\n");
> @@ -139,6 +141,9 @@ void deallocate_vbufs(int hca_num)
>      {
>           pthread_spin_unlock(&vbuf_lock);
>      }
> +
> +    if (err < 0)
> +       ibv_error_abort(IBV_RETURN_ERR, "could not deregister MR");
>  }
>  static int allocate_vbuf_region(int nvbufs)
> ++++++++
> Thanks,
> Polk
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



More information about the mvapich-discuss mailing list