[mvapich-discuss] Unexplained crashes reported ibv channel manager

Belgin, Mehmet mehmet.belgin at oit.gatech.edu
Sat Dec 24 10:11:54 EST 2016


Hi Hari,

I tried to identify the remote node with the assumption that MPI ranks are distributed in the same order as the hostlist from the scheduler, but couldn’t find any apparent issues with that node. I’ll create a rank <-> hostname lookup table for the future runs to make sure.

We recently upgraded OFED and kernel versions and I was mostly worried that the crashes are related to those upgrades, which would make the culprit hard to find. Hope this is just a hardware issue.

Any other suggestions will be greatly appreciated!

Thank you for your reply on a weekend and happy holidays :)

-Memo




From: Hari Subramoni<mailto:subramoni.1 at osu.edu>
Sent: Saturday, December 24, 2016 9:50 AM
To: Belgin, Mehmet<mailto:mehmet.belgin at oit.gatech.edu>
Cc: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Unexplained crashes reported ibv channel manager

Hi  Mehmet,

Sorry to hear that you are facing issues.

When you see this error, do you some other failures at other processes? For instance, do you see a failure at rank 623? The reason I ask is because we see such failures at a process when the remote process fails for some reason.

Regards,
Hari.


On Dec 23, 2016 10:57 PM, "Belgin, Mehmet" <mehmet.belgin at oit.gatech.edu<mailto:mehmet.belgin at oit.gatech.edu>> wrote:
Hi everyone,

We’ve been trying to troubleshoot crashes for some simulations that only happen for sufficiently large runs (>1024 cores) and after many hours of runtime (~20hrs). The researcher claim that their code is working on other clusters just fine and this crash only happens on our clusters.

Error messages look something like:

[iw-p41-27-r.pace.gatech.edu:mpi_rank_805][handle_cqe] Send desc error in msg to 623, wc_opcode=0
[iw-p41-27-r.pace.gatech.edu:mpi_rank_805][handle_cqe] Msg from 623: wc.status=12, wc.wr_id=0x591ac20, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[iw-p41-27-r.pace.gatech.edu:mpi_rank_805][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:587: [] Got completion with error 12, vendor code=0x81, dest rank=623
: Inappropriate ioctl for device (25)

I understand that this is coming from the IB layer (we use OFED-3.18-1) and probably not directly related to mvapich2, but I still wanted to ask for help in case you recognize this error and have suggestions for us. My websearch didn’t return much. This node (iw-p41-27-r) is not showing any errors and its IB connection appears to be healthy.

We tested these nodes repeatedly with OSU benchmarks and other applications at hand. These crashes happen only for large runs after a certain runtime as I mentioned above.

We would really appreciate any suggestions you may have.

Thank you, and happy holidays!

-Memo (Georgia Tech)


=========================================
Mehmet Belgin, Ph.D.
Scientific Computing Consultant
Partnership for an Advanced Computing Environment (PACE)
Georgia Institute of Technology
258 4th Street NW, Rich Building, #326
Atlanta, GA  30332-0700
Office: (404) 385-0665<tel:(404)%20385-0665>




_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161224/55d340bb/attachment.html>


More information about the mvapich-discuss mailing list