[mvapich-discuss] Unexplained crashes reported ibv channel manager

Hari Subramoni subramoni.1 at osu.edu
Sat Dec 24 09:50:12 EST 2016


Hi  Mehmet,

Sorry to hear that you are facing issues.

When you see this error, do you some other failures at other processes? For
instance, do you see a failure at rank 623? The reason I ask is because we
see such failures at a process when the remote process fails for some
reason.

Regards,
Hari.


On Dec 23, 2016 10:57 PM, "Belgin, Mehmet" <mehmet.belgin at oit.gatech.edu>
wrote:

Hi everyone,

We’ve been trying to troubleshoot crashes for some simulations that only
happen for sufficiently large runs (>1024 cores) and after many hours of
runtime (~20hrs). The researcher claim that their code is working on other
clusters just fine and this crash only happens on our clusters.

Error messages look something like:

[iw-p41-27-r.pace.gatech.edu:mpi_rank_805][handle_cqe] Send desc error in
msg to 623, wc_opcode=0
[iw-p41-27-r.pace.gatech.edu:mpi_rank_805][handle_cqe] Msg from 623:
wc.status=12, wc.wr_id=0x591ac20, wc.opcode=0, vbuf->phead->type=0
= MPIDI_CH3_PKT_EAGER_SEND
[iw-p41-27-r.pace.gatech.edu:mpi_rank_805][handle_cqe]
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587: [] Got
completion with error 12, vendor code=0x81, dest rank=623
: Inappropriate ioctl for device (25)

I understand that this is coming from the IB layer (we use OFED-3.18-1) and
probably not directly related to mvapich2, but I still wanted to ask for
help in case you recognize this error and have suggestions for us. My
websearch didn’t return much. This node (iw-p41-27-r) is not showing any
errors and its IB connection appears to be healthy.

We tested these nodes repeatedly with OSU benchmarks and other applications
at hand. These crashes happen only for large runs after a certain runtime
as I mentioned above.

We would really appreciate any suggestions you may have.

Thank you, and happy holidays!

-Memo (Georgia Tech)


=========================================
Mehmet Belgin, Ph.D.
Scientific Computing Consultant
Partnership for an Advanced Computing Environment (PACE)
Georgia Institute of Technology
258 4th Street NW, Rich Building, #326
Atlanta, GA  30332-0700
Office: (404) 385-0665




_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161224/47f422e2/attachment-0001.html>


More information about the mvapich-discuss mailing list