[mvapich-discuss] Segmentation fault while running application

Hari Subramoni subramoni.1 at osu.edu
Mon Aug 18 07:18:41 EDT 2014


Hi Chaitra,

Thanks for the report. We will take a look at this and get back to you
shortly.

Thx, Hari.

On Monday, August 18, 2014, Chaitra Kumar <chaitragkumar at gmail.com> wrote:

> Hi Hari and Team,
>
> To rule out any issues with existing libraries we reimaged the OS,
> re-installed all RDMA related drivers.  Still the problem persists.
>
> I also ran the application using valgrind.
>
> The command I used was:
>
> mpirun_rsh -np 72 -hostfile hostfile  MV2_DEBUG_CORESIZE=unlimited
> MV2_DEBUG_SHOW_BACKTRACE=1  MV2_ENABLE_AFFINITY=0 valgrind --tool=memcheck
> --leak-check=full --track-origins=yes --show-reachable=yes
> ./graph500_mpi_custom_72 28
>
> It generated 72 core files.
>
>
> Below is the backtrace generated for one of processes:
>
> #0  0x000000321f80f5db in raise () from /lib64/libpthread.so.0
> #1  <signal handler called>
> #2  0x000000000574b6ea in MPL_trfree ()
>    from /home/padmanac/mvapich-gdb/lib/libmpl.so.1
> #3  0x0000000004ed747c in MPIU_trfree (a_ptr=0xf0e0d0c9, line=4139,
>     fname=0x5292bdc "src/mpi/coll/ch3_shmem_coll.c") at
> src/util/mem/trmem.c:37
> #4  0x0000000005032868 in mv2_shm_coll_cleanup (shmem=0x1621cc28)
>     at src/mpi/coll/ch3_shmem_coll.c:4139
> #5  0x000000000517729e in free_2level_comm (comm_ptr=0x10697c18)
>     at src/mpi/comm/create_2level_comm.c:144
> #6  0x0000000004eca438 in MPIR_Comm_delete_internal (comm_ptr=0x10697c18,
>     isDisconnect=0) at src/mpi/comm/commutil.c:1918
> #7  0x000000000516a68e in MPIR_Comm_release (comm_ptr=0x10697c18,
>     isDisconnect=0) at ./src/include/mpiimpl.h:1331
> #8  0x000000000516a9f7 in PMPI_Comm_free (comm=0x7feffebb4)
>     at src/mpi/comm/comm_free.c:124
> #9  0x0000000000408f8d in scatter_bitmap_set::~scatter_bitmap_set (
>     this=0x7feffeba0, __in_chrg=<value optimized out>) at onesided.hpp:271
> #10 0x0000000000406b7c in validate_bfs_result (tg=0x7fefff1a0,
>     nglobalverts=268435456, nlocalverts=4194304, root=31958113,
>     pred=0xa73a10a8, edge_visit_count_ptr=0x7fefff118) at validate.cpp:449
> #11 0x0000000000403737 in main (argc=2, argv=0x7fefff498) at main.cpp:381
>
> Valgrind generated error log:
> [polaris-1:mpi_rank_36][error_sighandler] Caught error: Segmentation fault
> (sign
> al 11)
> ==131765== Invalid read of size 8
> ==131765==    at 0x574B6EA: MPL_trfree (in
> /home/padmanac/mvapich-gdb/lib/libmpl
> .so.1.0.0)
> ==131765==    by 0x4ED747B: MPIU_trfree (trmem.c:37)
> ==131765==    by 0x5032867: mv2_shm_coll_cleanup (ch3_shmem_coll.c:4139)
> ==131765==    by 0x517729D: free_2level_comm (create_2level_comm.c:144)
> ==131765==    by 0x4ECA437: MPIR_Comm_delete_internal (commutil.c:1918)
> ==131765==    by 0x516A68D: MPIR_Comm_release.clone.0 (mpiimpl.h:1331)
> ==131765==    by 0x516A9F6: PMPI_Comm_free (comm_free.c:124)
> ==131765==    by 0x408F8C: scatter_bitmap_set::~scatter_bitmap_set()
> (onesided.h
> pp:271)
> ==131765==    by 0x406B7B: validate_bfs_result(tuple_graph const*, long,
> unsigne
> d long, long, long*, long*) (validate.cpp:449)
> ==131765==    by 0x403736: main (main.cpp:381)
> ==131765==  Address 0xf0e0d0b9 is not stack'd, malloc'd or (recently)
> free'd
> ==131765==
>
>
> When I run the experiment with openMPI it finishes without any error.
>
> Any pointers to fix this is highly appreciated.
>
> Thanks for help.
>
> Regards,
> Chaitra
>
>
>
> On Fri, Aug 8, 2014 at 12:42 AM, Chaitra Kumar <chaitragkumar at gmail.com
> <javascript:_e(%7B%7D,'cvml','chaitragkumar at gmail.com');>> wrote:
>
>> Hi Hari,
>>
>> Thanks for your reply.  I will try to run the application with valgrind.
>> We have a requirement to use the tuned MPI version of Graph500 and more
>> over this implementation works perfectly  fine for many combinations of
>> scale on processes and only in some cases the  it crashes.
>>
>>
>> Based on the trace i got eariler, please let me know if following
>> possible cause of runtime error:
>>
>> *#0 0x00007f8ca65ca643 in free (mem=0x7f8bb5919010) at
>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485*
>> #1 0x000000000040915c in bitmap::clear (this=<value optimized out>) at
>> bitmap.cpp:54
>> #2 0x0000000000411cbe in run_bfs (root_raw=4685000, pred=0x7f8bbd9210a8,
>> settings=...) at bfs_custom.cpp:2036
>> #3 0x00000000004032ca in main (argc=2, argv=0x7fff5e9ce0d8) at
>> main.cpp:369
>>
>> -------------------------------------------------------------
>>
>> It is possible that when the bitmap object is created via the MPI
>> application (that is, the Graph 500),   the memory allocator used is the
>> default  C++ memory allocation.  However, at the crash point, where the
>> “free” method is invoked, the “free” method is from  the external memory
>> allocator  called “ptmalloc2”,  based on the captured trace information of “
>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485”
>> . Different memory allocators have different boundary guards (signatures).
>> When “free” is invoked,  it finds that the guard is not what is supposed to
>> be, should the corresponding ptmalloc2 ‘s  “malloc” is invoked, and thus
>> declares memory corruption and the application crashes.
>>
>>
>>
>> In summary, this likely is the memory allocator consistency problem, such
>> that the MPI application uses one memory allocator to “malloc” and the MPI
>> runtime uses a different memory allocator to “free”.  *So we need to
>> find out why MPI runtime comes to free the memory allocated by the MPI
>> application?  Should the data be copied across the MPI application/MPI
>> runtime boundary? *
>>
>> The typical way to have the entire application stack + middleware runtime
>> to  use a common memory allocator is   to package  the memory allocator
>> separately as a shared library, and then having the application to
>> dynamically link to this library. Further, in the SHELL environment of the
>> user that invokes the application, export “LD_LIBRARY_PATH” to include the
>> path where this memory allocator’s corresponding shared library is located,
>> so that the loader will load this external memory allocator and make sure
>> that  the function entry points : malloc() and free() are pointed to the
>> ones provided by this external memory allocator.  Such an approach is
>> described by the Google’s TcMalloc:
>> http://goog-perftools.sourceforge.net/doc/tcmalloc.html.
>>
>>
>>
>> But it seems that from the tracing information, MVAPICH actually
>> incorporates the ptmalloc2  at the source code level, instead of via a
>> dynamic shared library loading.
>>
>>
>> If this is the possible cause, how to fix it?  Please let me know.
>>
>> Regards,
>> Chaitra
>>
>>
>>
>>
>> On Thu, Aug 7, 2014 at 6:58 PM, Hari Subramoni <subramoni.1 at osu.edu
>> <javascript:_e(%7B%7D,'cvml','subramoni.1 at osu.edu');>> wrote:
>>
>>> Hello Chaitra,
>>>
>>> From the backtrace, it looks to be some memory corruption or out of
>>> memory condition. I do not think it is related the MPI library. Can you try
>>> running the application with valgrind or some other memory checker to see
>>> if there is some memory over run / leak in the Graph500 code?
>>>
>>> In the past, the following version of Graph500 worked fine for us. Would
>>> it be possible for you to try this out as well?
>>>
>>> http://www.graph500.org/sites/default/files/files/graph500-1.2.tar.bz2
>>>
>>> Regards,
>>> Hari.
>>>
>>>
>>> On Thu, Aug 7, 2014 at 7:19 AM, Chaitra Kumar <chaitragkumar at gmail.com
>>> <javascript:_e(%7B%7D,'cvml','chaitragkumar at gmail.com');>> wrote:
>>>
>>>> Hi Hari,
>>>>
>>>> I had earlier compiled the code in gcc 4.8.2.  Today i recompiled it
>>>> with gcc 4.4.7 and tried running Graph500.
>>>>
>>>> The configuration i used was:
>>>>
>>>> ./configure --prefix=/home/padmanac/mvapich2 --enable-cxx
>>>> --enable-threads=multiple --with-device=ch3:mrail --with-rdma=gen2 *--disable-fast
>>>> --enable-g=all *
>>>>
>>>> *--enable-error-messages=all*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The latest stacktrace is as below:
>>>> *#0 0x00007f8ca65ca643 in free (mem=0x7f8bb5919010) at
>>>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485*
>>>> #1 0x000000000040915c in bitmap::clear (this=<value optimized out>) at
>>>> bitmap.cpp:54
>>>> #2 0x0000000000411cbe in run_bfs (root_raw=4685000,
>>>> pred=0x7f8bbd9210a8, settings=...) at bfs_custom.cpp:2036
>>>> #3 0x00000000004032ca in main (argc=2, argv=0x7fff5e9ce0d8) at
>>>> main.cpp:369
>>>>
>>>> When I rebuilt MVAPICH2 with '*--disable-**registration-cache*', the  trace was was:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> #0  0x000000321f032925 in raise () from /lib64/libc.so.6
>>>> #1  0x000000321f034105 in abort () from /lib64/libc.so.6
>>>> #2  0x000000321f070837 in __libc_message () from /lib64/libc.so.6
>>>> #3  0x000000321f076166 in malloc_printerr () from /lib64/libc.so.6
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> #4  0x000000000040915c in bitmap::clear (this=<value optimized out>) at bitmap.cpp:54
>>>> #5  0x0000000000411c64 in run_bfs (root_raw=4795152, pred=0x7fdfb570f0a8, settings=...) at bfs_custom.cpp:2032
>>>> #6  0x00000000004032ca in main (argc=2, argv=0x7fff70f0c498) at main.cpp:369
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Please let me know if you need more information.
>>>>
>>>> Regards,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Chaitra
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Aug 6, 2014 at 11:33 PM, Chaitra Kumar <chaitragkumar at gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','chaitragkumar at gmail.com');>> wrote:
>>>>
>>>>> Hi Hari,
>>>>>
>>>>> I followed the steps specified by you.
>>>>>
>>>>> Still execution fails. The new trace is below:
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x0000003ea8a32925 in raise () from /lib64/libc.so.6
>>>>> #1  0x0000003ea8a34105 in abort () from /lib64/libc.so.6
>>>>> #2  0x0000003ea8a70837 in __libc_message () from /lib64/libc.so.6
>>>>> #3  0x0000003ea8a76166 in malloc_printerr () from /lib64/libc.so.6
>>>>> #4  0x00000000004054e5 in xfree ()
>>>>> #5  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>> #6  0x0000000000415c6e in run_bfs(long, long*, bfs_settings const&) ()
>>>>>     at bfs_custom.cpp:2032
>>>>> #7  0x0000000000403a5d in main () at main.cpp:370
>>>>>
>>>>>
>>>>> Regards,
>>>>> Chaitra
>>>>>
>>>>>
>>>>> On Wed, Aug 6, 2014 at 6:47 PM, Hari Subramoni <subramoni.1 at osu.edu
>>>>> <javascript:_e(%7B%7D,'cvml','subramoni.1 at osu.edu');>> wrote:
>>>>>
>>>>>> Hello Chaitra,
>>>>>>
>>>>>> Can you try rebuilding mvapich2 with the
>>>>>> "--disable-registration-cache" configure flag?
>>>>>>
>>>>>> ./configure --disable-registration-cache <other options>; make clean;
>>>>>> make -j 4; make install
>>>>>>
>>>>>> Once you've done this, please recompile your application and give it
>>>>>> a shot.
>>>>>>
>>>>>> Regards,
>>>>>> Hari.
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 6, 2014 at 1:47 AM, Chaitra Kumar <
>>>>>> chaitragkumar at gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','chaitragkumar at gmail.com');>> wrote:
>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>> Even with the setting  "-env MV2_USE_LAZY_MEM_UNREGISTER=0", there
>>>>>>> is no change in error and trace.
>>>>>>>
>>>>>>> Pasting the backtrace again:
>>>>>>> (gdb) bt
>>>>>>> #0  0x00007f5cdd8012dc in _int_free ()
>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>> #1  0x00007f5cdd7ffa96 in free ()
>>>>>>>
>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>> #2  0x00000000004054e5 in xfree ()
>>>>>>> #3  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>>>> #4  0x0000000000415cda in run_bfs(long, long*, bfs_settings const&)
>>>>>>> ()
>>>>>>>     at bfs_custom.cpp:2036
>>>>>>> #5  0x0000000000403a5d in main () at main.cpp:370
>>>>>>>
>>>>>>> any help is highly appreciated.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Chaitra
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 6, 2014 at 1:17 AM, Hari Subramoni <subramoni.1 at osu.edu
>>>>>>> <javascript:_e(%7B%7D,'cvml','subramoni.1 at osu.edu');>> wrote:
>>>>>>>
>>>>>>>> Hi Chaitra,
>>>>>>>>
>>>>>>>> Can you try running after setting "-env
>>>>>>>> MV2_USE_LAZY_MEM_UNREGISTER=0"?
>>>>>>>>
>>>>>>>> I'm cc'ing this note to our internal developer list. I would
>>>>>>>> appreciate it if you could respond to this e-mail so that we can give
>>>>>>>> feedback.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Hari.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 5, 2014 at 3:13 PM, Chaitra Kumar <
>>>>>>>> chaitragkumar at gmail.com
>>>>>>>> <javascript:_e(%7B%7D,'cvml','chaitragkumar at gmail.com');>> wrote:
>>>>>>>>
>>>>>>>>> Hi Hari,
>>>>>>>>>
>>>>>>>>> Thanks for the quick reply.
>>>>>>>>>  I am using tuned MPI implementation available on Graph500 site (
>>>>>>>>> http://www.graph500.org/referencecode).  I haven't modified this
>>>>>>>>> code.
>>>>>>>>>
>>>>>>>>> Only for some experiments it throws segmentation fault, other
>>>>>>>>> experiments complete without any errors. For eg.,
>>>>>>>>>
>>>>>>>>> The following command fails with segmentation fault:
>>>>>>>>>
>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited -env
>>>>>>>>> MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>> ./graph500_mpi_custom_50 *29*
>>>>>>>>>
>>>>>>>>> whereas if i change the scale to 28 or 30 the same code works
>>>>>>>>> without any error:
>>>>>>>>>
>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited -env
>>>>>>>>> MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>> ./graph500_mpi_custom_50 *30* .
>>>>>>>>>
>>>>>>>>> Please let me know if you want me to run with some other options.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Chaitra
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 6, 2014 at 12:20 AM, Hari Subramoni <
>>>>>>>>> subramoni.1 at osu.edu
>>>>>>>>> <javascript:_e(%7B%7D,'cvml','subramoni.1 at osu.edu');>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Chaitra,
>>>>>>>>>>
>>>>>>>>>> From the backtrace it seems that the failure (possibly a double
>>>>>>>>>> free) is happening in the application code. You mentioned that the Graph500
>>>>>>>>>> is a tuned version. Does this mean that you have made local code changes to
>>>>>>>>>> it? If there are, could you try using an unmodified version of Graph500 and
>>>>>>>>>> see if the same failure happens?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Hari.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 5, 2014 at 1:57 PM, Chaitra Kumar <
>>>>>>>>>> chaitragkumar at gmail.com
>>>>>>>>>> <javascript:_e(%7B%7D,'cvml','chaitragkumar at gmail.com');>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Team,
>>>>>>>>>>>
>>>>>>>>>>> I am trying to run Graph500 (tuned mpi version) on MVAPICH2-2.0.
>>>>>>>>>>> The machine has infiniband.
>>>>>>>>>>>
>>>>>>>>>>> I am using following configuration to build MVAPICH2 (have
>>>>>>>>>>> enabled debugging options):
>>>>>>>>>>>
>>>>>>>>>>> ./configure --prefix=/home/padmanac/mvapich2 --enable-cxx
>>>>>>>>>>> --enable-threads=multiple --with-device=ch3:mrail --with-rdma=gen2
>>>>>>>>>>> --disable-fast --enable-g=all --enable-error-messages=all
>>>>>>>>>>>
>>>>>>>>>>> The command which I am using is to launch Graph500 is:
>>>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited
>>>>>>>>>>> -env MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>>>> ./graph500_mpi_custom_50 29
>>>>>>>>>>>
>>>>>>>>>>> This command always result in segmentation fault. From the
>>>>>>>>>>> coredump i have got the backtrace.
>>>>>>>>>>>
>>>>>>>>>>> Please find the trace below:
>>>>>>>>>>>
>>>>>>>>>>> (gdb) bt
>>>>>>>>>>> #0  0x00007f38f84b52dc in _int_free ()
>>>>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>>>>> #1  0x00007f38f84b3a96 in free ()
>>>>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>>>>> #2  0x00000000004054e5 in xfree ()
>>>>>>>>>>> #3  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>>>>>>>> #4  0x0000000000415cda in run_bfs(long, long*, bfs_settings
>>>>>>>>>>> const&) ()
>>>>>>>>>>>     at bfs_custom.cpp:2036
>>>>>>>>>>> #5  0x0000000000403a5d in main () at main.cpp:370
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please let me know how to solve this? Am I missing some
>>>>>>>>>>> configuration flag?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Chaitra
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>>>> <javascript:_e(%7B%7D,'cvml','mvapich-discuss at cse.ohio-state.edu');>
>>>>>>>>>>>
>>>>>>>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140818/6d5cbd76/attachment-0001.html>


More information about the mvapich-discuss mailing list