[mvapich-discuss] Unexplained crashes reported ibv channel manager

Belgin, Mehmet mehmet.belgin at oit.gatech.edu
Fri Dec 23 12:27:00 EST 2016


Hi everyone,

We’ve been trying to troubleshoot crashes for some simulations that only happen for sufficiently large runs (>1024 cores) and after many hours of runtime (~20hrs). The researcher claim that their code is working on other clusters just fine and this crash only happens on our clusters.

Error messages look something like:

[iw-p41-27-r.pace.gatech.edu:mpi_rank_805][handle_cqe] Send desc error in msg to 623, wc_opcode=0
[iw-p41-27-r.pace.gatech.edu:mpi_rank_805][handle_cqe] Msg from 623: wc.status=12, wc.wr_id=0x591ac20, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[iw-p41-27-r.pace.gatech.edu:mpi_rank_805][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587: [] Got completion with error 12, vendor code=0x81, dest rank=623
: Inappropriate ioctl for device (25)

I understand that this is coming from the IB layer (we use OFED-3.18-1) and probably not directly related to mvapich2, but I still wanted to ask for help in case you recognize this error and have suggestions for us. My websearch didn’t return much. This node (iw-p41-27-r) is not showing any errors and its IB connection appears to be healthy.

We tested these nodes repeatedly with OSU benchmarks and other applications at hand. These crashes happen only for large runs after a certain runtime as I mentioned above.

We would really appreciate any suggestions you may have.

Thank you, and happy holidays!

-Memo (Georgia Tech)


=========================================
Mehmet Belgin, Ph.D.
Scientific Computing Consultant
Partnership for an Advanced Computing Environment (PACE)
Georgia Institute of Technology
258 4th Street NW, Rich Building, #326
Atlanta, GA  30332-0700
Office: (404) 385-0665



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161223/329cbe63/attachment.html>


More information about the mvapich-discuss mailing list