[mvapich-discuss] Deadlock with CUDA and InfiniBand

Hari Subramoni subramoni.1 at osu.edu
Thu Sep 11 14:35:53 EDT 2014


Hi Freddie,

We have not seen this error before. Our internal testing environment uses
regular OFED (OFED-1.5.3.2), not Intel OFED and it runs fine with PSM
(gQLogicIB-Basic.RHEL6-x86_64.7.0.1.0.43). So it could be that there is
some conflict between Intel's version of OFED and the PSM libraries.

Regards,
Hari.

On Thu, Sep 11, 2014 at 10:51 AM, Witherden, Freddie <
freddie.witherden08 at imperial.ac.uk> wrote:

> Hi Hari,
>
> > Thanks for the details. I understand the issue now. I do not think
> QLogic HCAs have the proper support for the
> > rdma fast path feature in MVAPICH2. This could be the reason why you saw
> the hang with that feature enabled.
> > And yes - for QLogic HCA's you should be building MVAPICH2 with ch3:psm
> for best performance and
> > functionality.
>
> Thank you for suggesting PSM.  I installed the Intel OFED stack on the
> cluster and recompiled MVAPICH2 with psm support.  Unfortunately, when
> running my application I get errors along the lines of:
>
>   compute-0-1.local.20860Unexpected error in writev(): Invalid argument
> (errno=22) (fd=7,iovec=0x7fffe52f4640,len=3) (err=23)
>
> on the nodes.  Switching back to the ch3:mrail build (with the fast path
> disabled, but on the new Intel stack) works, however.  Hence, I believe
> that the stack is functioning correctly -- just that there is something
> awry with PSM.
>
> Regards, Freddie.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140911/dfbac7a9/attachment.html>


More information about the mvapich-discuss mailing list