[mvapich-discuss] Segmentation fault while running application

Akshay Venkatesh akshay.v.3.14 at gmail.com
Thu Aug 21 19:18:08 EDT 2014


Chaitra,

We wanted confirm if the patch given to you off discuss (attached) fixed
the free error you noticed with graph500. We plan to include the bug fix in
the next release. Let us know when time permits if new experiments with the
patch were successful.



On Mon, Aug 18, 2014 at 5:08 AM, Chaitra Kumar <chaitragkumar at gmail.com>
wrote:

> Hi Hari and Team,
>
> To rule out any issues with existing libraries we reimaged the OS,
> re-installed all RDMA related drivers.  Still the problem persists.
>
> I also ran the application using valgrind.
>
> The command I used was:
>
> mpirun_rsh -np 72 -hostfile hostfile  MV2_DEBUG_CORESIZE=unlimited
> MV2_DEBUG_SHOW_BACKTRACE=1  MV2_ENABLE_AFFINITY=0 valgrind --tool=memcheck
> --leak-check=full --track-origins=yes --show-reachable=yes
> ./graph500_mpi_custom_72 28
>
> It generated 72 core files.
>
>
> Below is the backtrace generated for one of processes:
>
> #0  0x000000321f80f5db in raise () from /lib64/libpthread.so.0
> #1  <signal handler called>
> #2  0x000000000574b6ea in MPL_trfree ()
>    from /home/padmanac/mvapich-gdb/lib/libmpl.so.1
> #3  0x0000000004ed747c in MPIU_trfree (a_ptr=0xf0e0d0c9, line=4139,
>     fname=0x5292bdc "src/mpi/coll/ch3_shmem_coll.c") at
> src/util/mem/trmem.c:37
> #4  0x0000000005032868 in mv2_shm_coll_cleanup (shmem=0x1621cc28)
>     at src/mpi/coll/ch3_shmem_coll.c:4139
> #5  0x000000000517729e in free_2level_comm (comm_ptr=0x10697c18)
>     at src/mpi/comm/create_2level_comm.c:144
> #6  0x0000000004eca438 in MPIR_Comm_delete_internal (comm_ptr=0x10697c18,
>     isDisconnect=0) at src/mpi/comm/commutil.c:1918
> #7  0x000000000516a68e in MPIR_Comm_release (comm_ptr=0x10697c18,
>     isDisconnect=0) at ./src/include/mpiimpl.h:1331
> #8  0x000000000516a9f7 in PMPI_Comm_free (comm=0x7feffebb4)
>     at src/mpi/comm/comm_free.c:124
> #9  0x0000000000408f8d in scatter_bitmap_set::~scatter_bitmap_set (
>     this=0x7feffeba0, __in_chrg=<value optimized out>) at onesided.hpp:271
> #10 0x0000000000406b7c in validate_bfs_result (tg=0x7fefff1a0,
>     nglobalverts=268435456, nlocalverts=4194304, root=31958113,
>     pred=0xa73a10a8, edge_visit_count_ptr=0x7fefff118) at validate.cpp:449
> #11 0x0000000000403737 in main (argc=2, argv=0x7fefff498) at main.cpp:381
>
> Valgrind generated error log:
> [polaris-1:mpi_rank_36][error_sighandler] Caught error: Segmentation fault
> (sign
> al 11)
> ==131765== Invalid read of size 8
> ==131765==    at 0x574B6EA: MPL_trfree (in
> /home/padmanac/mvapich-gdb/lib/libmpl
> .so.1.0.0)
> ==131765==    by 0x4ED747B: MPIU_trfree (trmem.c:37)
> ==131765==    by 0x5032867: mv2_shm_coll_cleanup (ch3_shmem_coll.c:4139)
> ==131765==    by 0x517729D: free_2level_comm (create_2level_comm.c:144)
> ==131765==    by 0x4ECA437: MPIR_Comm_delete_internal (commutil.c:1918)
> ==131765==    by 0x516A68D: MPIR_Comm_release.clone.0 (mpiimpl.h:1331)
> ==131765==    by 0x516A9F6: PMPI_Comm_free (comm_free.c:124)
> ==131765==    by 0x408F8C: scatter_bitmap_set::~scatter_bitmap_set()
> (onesided.h
> pp:271)
> ==131765==    by 0x406B7B: validate_bfs_result(tuple_graph const*, long,
> unsigne
> d long, long, long*, long*) (validate.cpp:449)
> ==131765==    by 0x403736: main (main.cpp:381)
> ==131765==  Address 0xf0e0d0b9 is not stack'd, malloc'd or (recently)
> free'd
> ==131765==
>
>
> When I run the experiment with openMPI it finishes without any error.
>
> Any pointers to fix this is highly appreciated.
>
> Thanks for help.
>
> Regards,
> Chaitra
>
>
>
> On Fri, Aug 8, 2014 at 12:42 AM, Chaitra Kumar <chaitragkumar at gmail.com>
> wrote:
>
>> Hi Hari,
>>
>> Thanks for your reply.  I will try to run the application with valgrind.
>> We have a requirement to use the tuned MPI version of Graph500 and more
>> over this implementation works perfectly  fine for many combinations of
>> scale on processes and only in some cases the  it crashes.
>>
>>
>> Based on the trace i got eariler, please let me know if following
>> possible cause of runtime error:
>>
>> *#0 0x00007f8ca65ca643 in free (mem=0x7f8bb5919010) at
>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485*
>> #1 0x000000000040915c in bitmap::clear (this=<value optimized out>) at
>> bitmap.cpp:54
>> #2 0x0000000000411cbe in run_bfs (root_raw=4685000, pred=0x7f8bbd9210a8,
>> settings=...) at bfs_custom.cpp:2036
>> #3 0x00000000004032ca in main (argc=2, argv=0x7fff5e9ce0d8) at
>> main.cpp:369
>>
>> -------------------------------------------------------------
>>
>> It is possible that when the bitmap object is created via the MPI
>> application (that is, the Graph 500),   the memory allocator used is the
>> default  C++ memory allocation.  However, at the crash point, where the
>> “free” method is invoked, the “free” method is from  the external memory
>> allocator  called “ptmalloc2”,  based on the captured trace information of “
>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485”
>> . Different memory allocators have different boundary guards (signatures).
>> When “free” is invoked,  it finds that the guard is not what is supposed to
>> be, should the corresponding ptmalloc2 ‘s  “malloc” is invoked, and thus
>> declares memory corruption and the application crashes.
>>
>>
>>
>> In summary, this likely is the memory allocator consistency problem, such
>> that the MPI application uses one memory allocator to “malloc” and the MPI
>> runtime uses a different memory allocator to “free”.  *So we need to
>> find out why MPI runtime comes to free the memory allocated by the MPI
>> application?  Should the data be copied across the MPI application/MPI
>> runtime boundary? *
>>
>> The typical way to have the entire application stack + middleware runtime
>> to  use a common memory allocator is   to package  the memory allocator
>> separately as a shared library, and then having the application to
>> dynamically link to this library. Further, in the SHELL environment of the
>> user that invokes the application, export “LD_LIBRARY_PATH” to include the
>> path where this memory allocator’s corresponding shared library is located,
>> so that the loader will load this external memory allocator and make sure
>> that  the function entry points : malloc() and free() are pointed to the
>> ones provided by this external memory allocator.  Such an approach is
>> described by the Google’s TcMalloc:
>> http://goog-perftools.sourceforge.net/doc/tcmalloc.html.
>>
>>
>>
>> But it seems that from the tracing information, MVAPICH actually
>> incorporates the ptmalloc2  at the source code level, instead of via a
>> dynamic shared library loading.
>>
>>
>> If this is the possible cause, how to fix it?  Please let me know.
>>
>> Regards,
>> Chaitra
>>
>>
>>
>>
>> On Thu, Aug 7, 2014 at 6:58 PM, Hari Subramoni <subramoni.1 at osu.edu>
>> wrote:
>>
>>> Hello Chaitra,
>>>
>>> From the backtrace, it looks to be some memory corruption or out of
>>> memory condition. I do not think it is related the MPI library. Can you try
>>> running the application with valgrind or some other memory checker to see
>>> if there is some memory over run / leak in the Graph500 code?
>>>
>>> In the past, the following version of Graph500 worked fine for us. Would
>>> it be possible for you to try this out as well?
>>>
>>> http://www.graph500.org/sites/default/files/files/graph500-1.2.tar.bz2
>>>
>>> Regards,
>>> Hari.
>>>
>>>
>>> On Thu, Aug 7, 2014 at 7:19 AM, Chaitra Kumar <chaitragkumar at gmail.com>
>>> wrote:
>>>
>>>> Hi Hari,
>>>>
>>>> I had earlier compiled the code in gcc 4.8.2.  Today i recompiled it
>>>> with gcc 4.4.7 and tried running Graph500.
>>>>
>>>> The configuration i used was:
>>>>
>>>> ./configure --prefix=/home/padmanac/mvapich2 --enable-cxx
>>>> --enable-threads=multiple --with-device=ch3:mrail --with-rdma=gen2 *--disable-fast
>>>> --enable-g=all *
>>>>
>>>> *--enable-error-messages=all*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The latest stacktrace is as below:
>>>> *#0 0x00007f8ca65ca643 in free (mem=0x7f8bb5919010) at
>>>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485*
>>>> #1 0x000000000040915c in bitmap::clear (this=<value optimized out>) at
>>>> bitmap.cpp:54
>>>> #2 0x0000000000411cbe in run_bfs (root_raw=4685000,
>>>> pred=0x7f8bbd9210a8, settings=...) at bfs_custom.cpp:2036
>>>> #3 0x00000000004032ca in main (argc=2, argv=0x7fff5e9ce0d8) at
>>>> main.cpp:369
>>>>
>>>> When I rebuilt MVAPICH2 with '*--disable-**registration-cache*', the  trace was was:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> #0  0x000000321f032925 in raise () from /lib64/libc.so.6
>>>> #1  0x000000321f034105 in abort () from /lib64/libc.so.6
>>>> #2  0x000000321f070837 in __libc_message () from /lib64/libc.so.6
>>>> #3  0x000000321f076166 in malloc_printerr () from /lib64/libc.so.6
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> #4  0x000000000040915c in bitmap::clear (this=<value optimized out>) at bitmap.cpp:54
>>>> #5  0x0000000000411c64 in run_bfs (root_raw=4795152, pred=0x7fdfb570f0a8, settings=...) at bfs_custom.cpp:2032
>>>> #6  0x00000000004032ca in main (argc=2, argv=0x7fff70f0c498) at main.cpp:369
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Please let me know if you need more information.
>>>>
>>>> Regards,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Chaitra
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Aug 6, 2014 at 11:33 PM, Chaitra Kumar <chaitragkumar at gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Hari,
>>>>>
>>>>> I followed the steps specified by you.
>>>>>
>>>>> Still execution fails. The new trace is below:
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x0000003ea8a32925 in raise () from /lib64/libc.so.6
>>>>> #1  0x0000003ea8a34105 in abort () from /lib64/libc.so.6
>>>>> #2  0x0000003ea8a70837 in __libc_message () from /lib64/libc.so.6
>>>>> #3  0x0000003ea8a76166 in malloc_printerr () from /lib64/libc.so.6
>>>>> #4  0x00000000004054e5 in xfree ()
>>>>> #5  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>> #6  0x0000000000415c6e in run_bfs(long, long*, bfs_settings const&) ()
>>>>>     at bfs_custom.cpp:2032
>>>>> #7  0x0000000000403a5d in main () at main.cpp:370
>>>>>
>>>>>
>>>>> Regards,
>>>>> Chaitra
>>>>>
>>>>>
>>>>> On Wed, Aug 6, 2014 at 6:47 PM, Hari Subramoni <subramoni.1 at osu.edu>
>>>>> wrote:
>>>>>
>>>>>> Hello Chaitra,
>>>>>>
>>>>>> Can you try rebuilding mvapich2 with the
>>>>>> "--disable-registration-cache" configure flag?
>>>>>>
>>>>>> ./configure --disable-registration-cache <other options>; make clean;
>>>>>> make -j 4; make install
>>>>>>
>>>>>> Once you've done this, please recompile your application and give it
>>>>>> a shot.
>>>>>>
>>>>>> Regards,
>>>>>> Hari.
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 6, 2014 at 1:47 AM, Chaitra Kumar <
>>>>>> chaitragkumar at gmail.com> wrote:
>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>> Even with the setting  "-env MV2_USE_LAZY_MEM_UNREGISTER=0", there
>>>>>>> is no change in error and trace.
>>>>>>>
>>>>>>> Pasting the backtrace again:
>>>>>>> (gdb) bt
>>>>>>> #0  0x00007f5cdd8012dc in _int_free ()
>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>> #1  0x00007f5cdd7ffa96 in free ()
>>>>>>>
>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>> #2  0x00000000004054e5 in xfree ()
>>>>>>> #3  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>>>> #4  0x0000000000415cda in run_bfs(long, long*, bfs_settings const&)
>>>>>>> ()
>>>>>>>     at bfs_custom.cpp:2036
>>>>>>> #5  0x0000000000403a5d in main () at main.cpp:370
>>>>>>>
>>>>>>> any help is highly appreciated.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Chaitra
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 6, 2014 at 1:17 AM, Hari Subramoni <subramoni.1 at osu.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Chaitra,
>>>>>>>>
>>>>>>>> Can you try running after setting "-env
>>>>>>>> MV2_USE_LAZY_MEM_UNREGISTER=0"?
>>>>>>>>
>>>>>>>> I'm cc'ing this note to our internal developer list. I would
>>>>>>>> appreciate it if you could respond to this e-mail so that we can give
>>>>>>>> feedback.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Hari.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 5, 2014 at 3:13 PM, Chaitra Kumar <
>>>>>>>> chaitragkumar at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Hari,
>>>>>>>>>
>>>>>>>>> Thanks for the quick reply.
>>>>>>>>>  I am using tuned MPI implementation available on Graph500 site (
>>>>>>>>> http://www.graph500.org/referencecode).  I haven't modified this
>>>>>>>>> code.
>>>>>>>>>
>>>>>>>>> Only for some experiments it throws segmentation fault, other
>>>>>>>>> experiments complete without any errors. For eg.,
>>>>>>>>>
>>>>>>>>> The following command fails with segmentation fault:
>>>>>>>>>
>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited -env
>>>>>>>>> MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>> ./graph500_mpi_custom_50 *29*
>>>>>>>>>
>>>>>>>>> whereas if i change the scale to 28 or 30 the same code works
>>>>>>>>> without any error:
>>>>>>>>>
>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited -env
>>>>>>>>> MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>> ./graph500_mpi_custom_50 *30* .
>>>>>>>>>
>>>>>>>>> Please let me know if you want me to run with some other options.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Chaitra
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 6, 2014 at 12:20 AM, Hari Subramoni <
>>>>>>>>> subramoni.1 at osu.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Chaitra,
>>>>>>>>>>
>>>>>>>>>> From the backtrace it seems that the failure (possibly a double
>>>>>>>>>> free) is happening in the application code. You mentioned that the Graph500
>>>>>>>>>> is a tuned version. Does this mean that you have made local code changes to
>>>>>>>>>> it? If there are, could you try using an unmodified version of Graph500 and
>>>>>>>>>> see if the same failure happens?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Hari.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 5, 2014 at 1:57 PM, Chaitra Kumar <
>>>>>>>>>> chaitragkumar at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Team,
>>>>>>>>>>>
>>>>>>>>>>> I am trying to run Graph500 (tuned mpi version) on MVAPICH2-2.0.
>>>>>>>>>>> The machine has infiniband.
>>>>>>>>>>>
>>>>>>>>>>> I am using following configuration to build MVAPICH2 (have
>>>>>>>>>>> enabled debugging options):
>>>>>>>>>>>
>>>>>>>>>>> ./configure --prefix=/home/padmanac/mvapich2 --enable-cxx
>>>>>>>>>>> --enable-threads=multiple --with-device=ch3:mrail --with-rdma=gen2
>>>>>>>>>>> --disable-fast --enable-g=all --enable-error-messages=all
>>>>>>>>>>>
>>>>>>>>>>> The command which I am using is to launch Graph500 is:
>>>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited
>>>>>>>>>>> -env MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>>>> ./graph500_mpi_custom_50 29
>>>>>>>>>>>
>>>>>>>>>>> This command always result in segmentation fault. From the
>>>>>>>>>>> coredump i have got the backtrace.
>>>>>>>>>>>
>>>>>>>>>>> Please find the trace below:
>>>>>>>>>>>
>>>>>>>>>>> (gdb) bt
>>>>>>>>>>> #0  0x00007f38f84b52dc in _int_free ()
>>>>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>>>>> #1  0x00007f38f84b3a96 in free ()
>>>>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>>>>> #2  0x00000000004054e5 in xfree ()
>>>>>>>>>>> #3  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>>>>>>>> #4  0x0000000000415cda in run_bfs(long, long*, bfs_settings
>>>>>>>>>>> const&) ()
>>>>>>>>>>>     at bfs_custom.cpp:2036
>>>>>>>>>>> #5  0x0000000000403a5d in main () at main.cpp:370
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please let me know how to solve this? Am I missing some
>>>>>>>>>>> configuration flag?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Chaitra
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>>>>
>>>>>>>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
-Akshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140821/a47f1003/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shm_cleanup.patch
Type: text/x-patch
Size: 1213 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140821/a47f1003/attachment-0001.bin>


More information about the mvapich-discuss mailing list