[mvapich-discuss] ibv_channel_manager error message

Martin Pokorny mpokorny at nrao.edu
Thu May 5 15:16:12 EDT 2016


On 05/05/2016 11:22 AM, Hari Subramoni wrote:
> Was this the first error that was observed or were there any other
> failures before this? For instance, could you please let us know if the
> destination rank (looks like it was 7 in this case) failed for some
> reason like a segfault / assertion?

 From what I can tell it was the first error in this instance. However, 
the system on which the error occurred doesn't run interactively, and 
logging is somewhat unreliable, especially when processes crash, so that 
I can't be certain that it was actually the first error. In one other 
case, I'm able to confirm that there was a segfault in the destination 
rank (maybe) prior to a message like the one I provided. If a segfault 
in the destination rank can trigger such an error, that's a useful bit 
of information for me.

> Could you send us the output of mpiname -a?

$ mpiname -a
MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:mrail

Compilation
CC: gcc    -DNDEBUG -DNVALGRIND -g -O2
CXX: g++   -DNDEBUG -DNVALGRIND -g -O2
F77: gfortran -L/lib -L/lib   -g -O2
FC: gfortran   -g -O2

Configuration
--enable-romio --with-file-system=lustre --enable-debuginfo 
--enable-g=dbg,log --with-limic2 --enable-rdma-cm

> If you've a debug build, can you rerun it with
> MV2_DEBUG_SHOW_BACKTRACE=2 and send us the backtrace?

That is already the case, but, as I mentioned, capturing a backtrace in 
the log is unreliable. I haven't got one for this error, so far.



> On Thu, May 5, 2016 at 1:06 PM, Martin Pokorny <mpokorny at nrao.edu
> <mailto:mpokorny at nrao.edu>> wrote:
>
>     I have on occasion been seeing error messages like the following:
>
>         [cbe-node-29:mpi_rank_32][handle_cqe]
>         ../src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587:
>         [] Got completion with error 4, vendor code=0x54, dest rank=7
>
>
>     There's no message written by the receiving rank at the time the
>     sending rank wrote this message. Can anyone shed any light on what
>     the underlying cause might be? I know that I've not provided much
>     information, but I'm happy to provide more if it would be helpful.
>     I'm using mvapich2-2.1 on RHEL 6.3.
>
>     --
>     Martin
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
Martin Pokorny
Software Engineer - Janksy Very Large Array and CASA
National Radio Astronomy Observatory - New Mexico


More information about the mvapich-discuss mailing list