[Mvapich-discuss] CQ error and exit()

Lana Deere lana.deere at gmail.com
Mon Sep 27 14:36:45 EDT 2021


I got the following errors in an MPI process.

Error getting event!
[node3:mpi_rank_1][async_thread]
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1307: Got FATAL
event CQ error on CQ (nil)
: Interrupted system call (4)
*** Error in `program': double free or corruption (!prev):
0x000000000b8f3e40 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81489)[0x2b689c944489]
/lib64/libc.so.6(+0x39b69)[0x2b689c8fcb69]
/lib64/libc.so.6(+0x39bb7)[0x2b689c8fcbb7]
.../third_party/lib/libmpi.so.12(async_thread+0x29c)[0x2b68ac651e4c]
/lib64/libpthread.so.0(+0x7dd5)[0x2b689ae6edd5]
/lib64/libc.so.6(clone+0x6d)[0x2b689c9c0ead]
======= Memory map: ========
[skipped]

The rest of the MPI processes just hung at that point.

If the "interrupted system call" is part of the same error report as the CQ
error, perhaps a check for EINTR with a retry is needed somewhere?

I noticed in ibv_channel_manager.c / async_thread() that it sends the CQ
error using ibv_ca_error_abort() which in turn prints the error message and
the calls exit().  If the program is multithreaded (ours is) then exit
triggers things like static destructors while some of the threads are still
using that data and this causes cascading errors which makes it difficult
to figure out the original error.  I think it would be better, if it has to
exit, to use _exit() instead.  Even better would be if it can cache the
error result and make the next MPI API call return an error so that the
application could clean itself up (including the other processes).

This was on 2.3.5, but 2.3.6 looks the same in these respects.

.. Lana (lana.deere at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20210927/dc09aa32/attachment-0021.html>


More information about the Mvapich-discuss mailing list