[mvapich-discuss] ibv_channel_manager error message

Hari Subramoni subramoni.1 at osu.edu
Thu May 5 15:20:10 EDT 2016


Hi Martin,

It is likely that a remote process failing caused this. As MVAPICH2 uses
the reliable and connection oriented RC transport protocol by default, a
process will report a failure in IB operations if the remote process failed
because the IB connections are broken.

Thx,
Hari.

On Thu, May 5, 2016 at 3:16 PM, Martin Pokorny <mpokorny at nrao.edu> wrote:

> On 05/05/2016 11:22 AM, Hari Subramoni wrote:
>
>> Was this the first error that was observed or were there any other
>> failures before this? For instance, could you please let us know if the
>> destination rank (looks like it was 7 in this case) failed for some
>> reason like a segfault / assertion?
>>
>
> From what I can tell it was the first error in this instance. However, the
> system on which the error occurred doesn't run interactively, and logging
> is somewhat unreliable, especially when processes crash, so that I can't be
> certain that it was actually the first error. In one other case, I'm able
> to confirm that there was a segfault in the destination rank (maybe) prior
> to a message like the one I provided. If a segfault in the destination rank
> can trigger such an error, that's a useful bit of information for me.
>
> Could you send us the output of mpiname -a?
>>
>
> $ mpiname -a
> MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:mrail
>
> Compilation
> CC: gcc    -DNDEBUG -DNVALGRIND -g -O2
> CXX: g++   -DNDEBUG -DNVALGRIND -g -O2
> F77: gfortran -L/lib -L/lib   -g -O2
> FC: gfortran   -g -O2
>
> Configuration
> --enable-romio --with-file-system=lustre --enable-debuginfo
> --enable-g=dbg,log --with-limic2 --enable-rdma-cm
>
> If you've a debug build, can you rerun it with
>> MV2_DEBUG_SHOW_BACKTRACE=2 and send us the backtrace?
>>
>
> That is already the case, but, as I mentioned, capturing a backtrace in
> the log is unreliable. I haven't got one for this error, so far.
>
>
>
> On Thu, May 5, 2016 at 1:06 PM, Martin Pokorny <mpokorny at nrao.edu
>> <mailto:mpokorny at nrao.edu>> wrote:
>>
>>     I have on occasion been seeing error messages like the following:
>>
>>         [cbe-node-29:mpi_rank_32][handle_cqe]
>>         ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587:
>>         [] Got completion with error 4, vendor code=0x54, dest rank=7
>>
>>
>>     There's no message written by the receiving rank at the time the
>>     sending rank wrote this message. Can anyone shed any light on what
>>     the underlying cause might be? I know that I've not provided much
>>     information, but I'm happy to provide more if it would be helpful.
>>     I'm using mvapich2-2.1 on RHEL 6.3.
>>
>>     --
>>     Martin
>>     _______________________________________________
>>     mvapich-discuss mailing list
>>     mvapich-discuss at cse.ohio-state.edu
>>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>
> --
> Martin Pokorny
> Software Engineer - Janksy Very Large Array and CASA
> National Radio Astronomy Observatory - New Mexico
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160505/6c46c26b/attachment.html>


More information about the mvapich-discuss mailing list