[mvapich-discuss] SEGV during application termination

Hari Subramoni subramon at cse.ohio-state.edu
Mon Jul 12 08:02:49 EDT 2010


Hi,

I'm assuming this happens with the SoftiWARP stack - right?

>From the stack trace, it looks like the segfault is happening at the write
call in 'rdma_get_cm_event' when rdma_cm calls it to get a new event.
Whatever MVAPICH2 does, it should not be affecting the OFED rdma_cm or
anything in the kernel. Have you had a chance to check out this failure
with the developers of the SoftiWARP stack?

Thx,
Hari.


On Mon, 12 Jul 2010, TJC Ward wrote:

> Whenever I run an application with mvapich2, I get a sigsegv and core dump
> from all nodes during MPI_Finalize . The stack back-chain is
> Core was generated by `/opt/mvapich/bin/mpiBench_Allreduce -i 1'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x0fb0d968 in _Unwind_IteratePhdrCallback () from /lib/libc.so.6
> (gdb) where
> #0  0x0fb0d968 in _Unwind_IteratePhdrCallback () from /lib/libc.so.6
> #1  0x0fb0b934 in dl_iterate_phdr () from /lib/libc.so.6
> #2  0x0fb0e594 in _Unwind_Find_FDE () from /lib/libc.so.6
> #3  0x0f81c194 in ?? () from /lib/libgcc_s.so.1
> #4  0x0f81deec in ?? () from /lib/libgcc_s.so.1
> #5  0x0f81e4c0 in _Unwind_ForcedUnwind () from /lib/libgcc_s.so.1
> #6  0x0fbb1030 in _Unwind_ForcedUnwind () from /lib/libpthread.so.0
> #7  0x0fbadfec in __pthread_unwind () from /lib/libpthread.so.0
> #8  0x0fba4e1c in sigcancel_handler () from /lib/libpthread.so.0
> #9  <signal handler called>
> #10 0x0fbaeb44 in write () from /lib/libpthread.so.0
> #11 0x0ff139c4 in rdma_get_cm_event (channel=0x1017f590, event=0xb7fbee38)
> at src/cma.c:1304
> #12 0x0fd29204 in cm_thread () from /opt/mvapich/lib/libmpich.so.1.2
> #13 0x0fba5c10 in start_thread () from /lib/libpthread.so.0
> #14 0x0face344 in clone () from /lib/libc.so.6
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> (gdb)
>
> I can see that this is the RDMA CM thread; presumably something is sending
> it a signal to bring rdma_get_cm_event out of the kernel as part of
> termination processing. But I don't know what it sending the signal (is it
> the device driver for the hardware, or is it mvapich ? ), and I don't know
> whether the mishandling of the signal is something that should be fixed in
> mvapich, or whether it's a problem with my 'glibc' run-time.
>
> I get this behaviour with the new mvapich2 1.5 , and also with the
> previous versions of mvapich2 that are to hand.
>
> If anyone else sees this behaviour, or knows what's causing it, please let
> me know.
>
>
> T J (Chris) Ward, IBM Research
> Scalable Data-Centric Computing - Active Storage Fabrics - IBM System
> BlueGene
> IBM United Kingdom Ltd., Hursley Park, Winchester, Hants, SO21 2JN
> 011-44-1962-818679
> IBM Intranet http://hurgsa.ibm.com/~tjcw/
>
> IBM System BlueGene Research
> IBM System BlueGene Marketing
>
> IBM Resources for Global Servants
> IBM Branded Products IBM Branded Swag
>
>
> UNIX in the Cloud - Find A Place Where There's Room To Grow, with the
> original Open Standard. Free Trial Here Today
> New Lamps For Old - Diskless Remote Boot Linux from National Center for
> High-Performance Computing, Taiwan
>



More information about the mvapich-discuss mailing list