[mvapich-discuss] Segmentation fault while running application

Chaitra Kumar chaitragkumar at gmail.com
Mon Aug 25 04:39:59 EDT 2014


Hi Akshay,

With the patch I haven't got the error.   I am able to run many experiments.

Thanks for your help.

Regards,
Chaitra


On Fri, Aug 22, 2014 at 4:48 AM, Akshay Venkatesh <akshay.v.3.14 at gmail.com>
wrote:

> Chaitra,
>
> We wanted confirm if the patch given to you off discuss (attached) fixed
> the free error you noticed with graph500. We plan to include the bug fix in
> the next release. Let us know when time permits if new experiments with the
> patch were successful.
>
>
>
> On Mon, Aug 18, 2014 at 5:08 AM, Chaitra Kumar <chaitragkumar at gmail.com>
> wrote:
>
>> Hi Hari and Team,
>>
>> To rule out any issues with existing libraries we reimaged the OS,
>> re-installed all RDMA related drivers.  Still the problem persists.
>>
>> I also ran the application using valgrind.
>>
>> The command I used was:
>>
>> mpirun_rsh -np 72 -hostfile hostfile  MV2_DEBUG_CORESIZE=unlimited
>> MV2_DEBUG_SHOW_BACKTRACE=1  MV2_ENABLE_AFFINITY=0 valgrind --tool=memcheck
>> --leak-check=full --track-origins=yes --show-reachable=yes
>> ./graph500_mpi_custom_72 28
>>
>> It generated 72 core files.
>>
>>
>> Below is the backtrace generated for one of processes:
>>
>> #0  0x000000321f80f5db in raise () from /lib64/libpthread.so.0
>> #1  <signal handler called>
>> #2  0x000000000574b6ea in MPL_trfree ()
>>    from /home/padmanac/mvapich-gdb/lib/libmpl.so.1
>> #3  0x0000000004ed747c in MPIU_trfree (a_ptr=0xf0e0d0c9, line=4139,
>>     fname=0x5292bdc "src/mpi/coll/ch3_shmem_coll.c") at
>> src/util/mem/trmem.c:37
>> #4  0x0000000005032868 in mv2_shm_coll_cleanup (shmem=0x1621cc28)
>>     at src/mpi/coll/ch3_shmem_coll.c:4139
>> #5  0x000000000517729e in free_2level_comm (comm_ptr=0x10697c18)
>>     at src/mpi/comm/create_2level_comm.c:144
>> #6  0x0000000004eca438 in MPIR_Comm_delete_internal (comm_ptr=0x10697c18,
>>     isDisconnect=0) at src/mpi/comm/commutil.c:1918
>> #7  0x000000000516a68e in MPIR_Comm_release (comm_ptr=0x10697c18,
>>     isDisconnect=0) at ./src/include/mpiimpl.h:1331
>> #8  0x000000000516a9f7 in PMPI_Comm_free (comm=0x7feffebb4)
>>     at src/mpi/comm/comm_free.c:124
>> #9  0x0000000000408f8d in scatter_bitmap_set::~scatter_bitmap_set (
>>     this=0x7feffeba0, __in_chrg=<value optimized out>) at onesided.hpp:271
>> #10 0x0000000000406b7c in validate_bfs_result (tg=0x7fefff1a0,
>>     nglobalverts=268435456, nlocalverts=4194304, root=31958113,
>>     pred=0xa73a10a8, edge_visit_count_ptr=0x7fefff118) at validate.cpp:449
>> #11 0x0000000000403737 in main (argc=2, argv=0x7fefff498) at main.cpp:381
>>
>> Valgrind generated error log:
>> [polaris-1:mpi_rank_36][error_sighandler] Caught error: Segmentation
>> fault (sign
>> al 11)
>> ==131765== Invalid read of size 8
>> ==131765==    at 0x574B6EA: MPL_trfree (in
>> /home/padmanac/mvapich-gdb/lib/libmpl
>> .so.1.0.0)
>> ==131765==    by 0x4ED747B: MPIU_trfree (trmem.c:37)
>> ==131765==    by 0x5032867: mv2_shm_coll_cleanup (ch3_shmem_coll.c:4139)
>> ==131765==    by 0x517729D: free_2level_comm (create_2level_comm.c:144)
>> ==131765==    by 0x4ECA437: MPIR_Comm_delete_internal (commutil.c:1918)
>> ==131765==    by 0x516A68D: MPIR_Comm_release.clone.0 (mpiimpl.h:1331)
>> ==131765==    by 0x516A9F6: PMPI_Comm_free (comm_free.c:124)
>> ==131765==    by 0x408F8C: scatter_bitmap_set::~scatter_bitmap_set()
>> (onesided.h
>> pp:271)
>> ==131765==    by 0x406B7B: validate_bfs_result(tuple_graph const*, long,
>> unsigne
>> d long, long, long*, long*) (validate.cpp:449)
>> ==131765==    by 0x403736: main (main.cpp:381)
>> ==131765==  Address 0xf0e0d0b9 is not stack'd, malloc'd or (recently)
>> free'd
>> ==131765==
>>
>>
>> When I run the experiment with openMPI it finishes without any error.
>>
>> Any pointers to fix this is highly appreciated.
>>
>> Thanks for help.
>>
>> Regards,
>> Chaitra
>>
>>
>>
>> On Fri, Aug 8, 2014 at 12:42 AM, Chaitra Kumar <chaitragkumar at gmail.com>
>> wrote:
>>
>>> Hi Hari,
>>>
>>> Thanks for your reply.  I will try to run the application with valgrind.
>>> We have a requirement to use the tuned MPI version of Graph500 and more
>>> over this implementation works perfectly  fine for many combinations of
>>> scale on processes and only in some cases the  it crashes.
>>>
>>>
>>> Based on the trace i got eariler, please let me know if following
>>> possible cause of runtime error:
>>>
>>> *#0 0x00007f8ca65ca643 in free (mem=0x7f8bb5919010) at
>>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485*
>>> #1 0x000000000040915c in bitmap::clear (this=<value optimized out>) at
>>> bitmap.cpp:54
>>> #2 0x0000000000411cbe in run_bfs (root_raw=4685000, pred=0x7f8bbd9210a8,
>>> settings=...) at bfs_custom.cpp:2036
>>> #3 0x00000000004032ca in main (argc=2, argv=0x7fff5e9ce0d8) at
>>> main.cpp:369
>>>
>>> -------------------------------------------------------------
>>>
>>> It is possible that when the bitmap object is created via the MPI
>>> application (that is, the Graph 500),   the memory allocator used is the
>>> default  C++ memory allocation.  However, at the crash point, where the
>>> “free” method is invoked, the “free” method is from  the external memory
>>> allocator  called “ptmalloc2”,  based on the captured trace information of “
>>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485”
>>> . Different memory allocators have different boundary guards (signatures).
>>> When “free” is invoked,  it finds that the guard is not what is supposed to
>>> be, should the corresponding ptmalloc2 ‘s  “malloc” is invoked, and thus
>>> declares memory corruption and the application crashes.
>>>
>>>
>>>
>>> In summary, this likely is the memory allocator consistency problem,
>>> such that the MPI application uses one memory allocator to “malloc” and the
>>> MPI runtime uses a different memory allocator to “free”.  *So we need
>>> to find out why MPI runtime comes to free the memory allocated by the MPI
>>> application?  Should the data be copied across the MPI application/MPI
>>> runtime boundary? *
>>>
>>> The typical way to have the entire application stack + middleware
>>> runtime to  use a common memory allocator is   to package  the memory
>>> allocator separately as a shared library, and then having the application
>>> to dynamically link to this library. Further, in the SHELL environment of
>>> the user that invokes the application, export “LD_LIBRARY_PATH” to include
>>> the path where this memory allocator’s corresponding shared library is
>>> located, so that the loader will load this external memory allocator and
>>> make sure that  the function entry points : malloc() and free() are pointed
>>> to the ones provided by this external memory allocator.  Such an approach
>>> is described by the Google’s TcMalloc:
>>> http://goog-perftools.sourceforge.net/doc/tcmalloc.html.
>>>
>>>
>>>
>>> But it seems that from the tracing information, MVAPICH actually
>>> incorporates the ptmalloc2  at the source code level, instead of via a
>>> dynamic shared library loading.
>>>
>>>
>>> If this is the possible cause, how to fix it?  Please let me know.
>>>
>>> Regards,
>>> Chaitra
>>>
>>>
>>>
>>>
>>> On Thu, Aug 7, 2014 at 6:58 PM, Hari Subramoni <subramoni.1 at osu.edu>
>>> wrote:
>>>
>>>> Hello Chaitra,
>>>>
>>>> From the backtrace, it looks to be some memory corruption or out of
>>>> memory condition. I do not think it is related the MPI library. Can you try
>>>> running the application with valgrind or some other memory checker to see
>>>> if there is some memory over run / leak in the Graph500 code?
>>>>
>>>> In the past, the following version of Graph500 worked fine for us.
>>>> Would it be possible for you to try this out as well?
>>>>
>>>> http://www.graph500.org/sites/default/files/files/graph500-1.2.tar.bz2
>>>>
>>>> Regards,
>>>> Hari.
>>>>
>>>>
>>>> On Thu, Aug 7, 2014 at 7:19 AM, Chaitra Kumar <chaitragkumar at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Hari,
>>>>>
>>>>> I had earlier compiled the code in gcc 4.8.2.  Today i recompiled it
>>>>> with gcc 4.4.7 and tried running Graph500.
>>>>>
>>>>> The configuration i used was:
>>>>>
>>>>> ./configure --prefix=/home/padmanac/mvapich2 --enable-cxx
>>>>> --enable-threads=multiple --with-device=ch3:mrail --with-rdma=gen2 *--disable-fast
>>>>> --enable-g=all *
>>>>>
>>>>> *--enable-error-messages=all*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The latest stacktrace is as below:
>>>>> *#0 0x00007f8ca65ca643 in free (mem=0x7f8bb5919010) at
>>>>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485*
>>>>> #1 0x000000000040915c in bitmap::clear (this=<value optimized out>) at
>>>>> bitmap.cpp:54
>>>>> #2 0x0000000000411cbe in run_bfs (root_raw=4685000,
>>>>> pred=0x7f8bbd9210a8, settings=...) at bfs_custom.cpp:2036
>>>>> #3 0x00000000004032ca in main (argc=2, argv=0x7fff5e9ce0d8) at
>>>>> main.cpp:369
>>>>>
>>>>> When I rebuilt MVAPICH2 with '*--disable-**registration-cache*', the  trace was was:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> #0  0x000000321f032925 in raise () from /lib64/libc.so.6
>>>>> #1  0x000000321f034105 in abort () from /lib64/libc.so.6
>>>>> #2  0x000000321f070837 in __libc_message () from /lib64/libc.so.6
>>>>> #3  0x000000321f076166 in malloc_printerr () from /lib64/libc.so.6
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> #4  0x000000000040915c in bitmap::clear (this=<value optimized out>) at bitmap.cpp:54
>>>>> #5  0x0000000000411c64 in run_bfs (root_raw=4795152, pred=0x7fdfb570f0a8, settings=...) at bfs_custom.cpp:2032
>>>>> #6  0x00000000004032ca in main (argc=2, argv=0x7fff70f0c498) at main.cpp:369
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Please let me know if you need more information.
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Chaitra
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 6, 2014 at 11:33 PM, Chaitra Kumar <
>>>>> chaitragkumar at gmail.com> wrote:
>>>>>
>>>>>> Hi Hari,
>>>>>>
>>>>>> I followed the steps specified by you.
>>>>>>
>>>>>> Still execution fails. The new trace is below:
>>>>>>
>>>>>> (gdb) bt
>>>>>> #0  0x0000003ea8a32925 in raise () from /lib64/libc.so.6
>>>>>> #1  0x0000003ea8a34105 in abort () from /lib64/libc.so.6
>>>>>> #2  0x0000003ea8a70837 in __libc_message () from /lib64/libc.so.6
>>>>>> #3  0x0000003ea8a76166 in malloc_printerr () from /lib64/libc.so.6
>>>>>> #4  0x00000000004054e5 in xfree ()
>>>>>> #5  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>>> #6  0x0000000000415c6e in run_bfs(long, long*, bfs_settings const&) ()
>>>>>>     at bfs_custom.cpp:2032
>>>>>> #7  0x0000000000403a5d in main () at main.cpp:370
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Chaitra
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 6, 2014 at 6:47 PM, Hari Subramoni <subramoni.1 at osu.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Chaitra,
>>>>>>>
>>>>>>> Can you try rebuilding mvapich2 with the
>>>>>>> "--disable-registration-cache" configure flag?
>>>>>>>
>>>>>>> ./configure --disable-registration-cache <other options>; make
>>>>>>> clean; make -j 4; make install
>>>>>>>
>>>>>>> Once you've done this, please recompile your application and give it
>>>>>>> a shot.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Hari.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 6, 2014 at 1:47 AM, Chaitra Kumar <
>>>>>>> chaitragkumar at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Team,
>>>>>>>>
>>>>>>>> Even with the setting  "-env MV2_USE_LAZY_MEM_UNREGISTER=0", there
>>>>>>>> is no change in error and trace.
>>>>>>>>
>>>>>>>> Pasting the backtrace again:
>>>>>>>> (gdb) bt
>>>>>>>> #0  0x00007f5cdd8012dc in _int_free ()
>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>> #1  0x00007f5cdd7ffa96 in free ()
>>>>>>>>
>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>> #2  0x00000000004054e5 in xfree ()
>>>>>>>> #3  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>>>>> #4  0x0000000000415cda in run_bfs(long, long*, bfs_settings const&)
>>>>>>>> ()
>>>>>>>>     at bfs_custom.cpp:2036
>>>>>>>> #5  0x0000000000403a5d in main () at main.cpp:370
>>>>>>>>
>>>>>>>> any help is highly appreciated.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Chaitra
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 6, 2014 at 1:17 AM, Hari Subramoni <subramoni.1 at osu.edu
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Chaitra,
>>>>>>>>>
>>>>>>>>> Can you try running after setting "-env
>>>>>>>>> MV2_USE_LAZY_MEM_UNREGISTER=0"?
>>>>>>>>>
>>>>>>>>> I'm cc'ing this note to our internal developer list. I would
>>>>>>>>> appreciate it if you could respond to this e-mail so that we can give
>>>>>>>>> feedback.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Hari.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 5, 2014 at 3:13 PM, Chaitra Kumar <
>>>>>>>>> chaitragkumar at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Hari,
>>>>>>>>>>
>>>>>>>>>> Thanks for the quick reply.
>>>>>>>>>>  I am using tuned MPI implementation available on Graph500 site (
>>>>>>>>>> http://www.graph500.org/referencecode).  I haven't modified this
>>>>>>>>>> code.
>>>>>>>>>>
>>>>>>>>>> Only for some experiments it throws segmentation fault, other
>>>>>>>>>> experiments complete without any errors. For eg.,
>>>>>>>>>>
>>>>>>>>>> The following command fails with segmentation fault:
>>>>>>>>>>
>>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited
>>>>>>>>>> -env MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>>> ./graph500_mpi_custom_50 *29*
>>>>>>>>>>
>>>>>>>>>> whereas if i change the scale to 28 or 30 the same code works
>>>>>>>>>> without any error:
>>>>>>>>>>
>>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited
>>>>>>>>>> -env MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>>> ./graph500_mpi_custom_50 *30* .
>>>>>>>>>>
>>>>>>>>>> Please let me know if you want me to run with some other options.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Chaitra
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 6, 2014 at 12:20 AM, Hari Subramoni <
>>>>>>>>>> subramoni.1 at osu.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Chaitra,
>>>>>>>>>>>
>>>>>>>>>>> From the backtrace it seems that the failure (possibly a double
>>>>>>>>>>> free) is happening in the application code. You mentioned that the Graph500
>>>>>>>>>>> is a tuned version. Does this mean that you have made local code changes to
>>>>>>>>>>> it? If there are, could you try using an unmodified version of Graph500 and
>>>>>>>>>>> see if the same failure happens?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Hari.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Aug 5, 2014 at 1:57 PM, Chaitra Kumar <
>>>>>>>>>>> chaitragkumar at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>
>>>>>>>>>>>> I am trying to run Graph500 (tuned mpi version) on
>>>>>>>>>>>> MVAPICH2-2.0. The machine has infiniband.
>>>>>>>>>>>>
>>>>>>>>>>>> I am using following configuration to build MVAPICH2 (have
>>>>>>>>>>>> enabled debugging options):
>>>>>>>>>>>>
>>>>>>>>>>>> ./configure --prefix=/home/padmanac/mvapich2 --enable-cxx
>>>>>>>>>>>> --enable-threads=multiple --with-device=ch3:mrail --with-rdma=gen2
>>>>>>>>>>>> --disable-fast --enable-g=all --enable-error-messages=all
>>>>>>>>>>>>
>>>>>>>>>>>> The command which I am using is to launch Graph500 is:
>>>>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited
>>>>>>>>>>>> -env MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>>>>> ./graph500_mpi_custom_50 29
>>>>>>>>>>>>
>>>>>>>>>>>> This command always result in segmentation fault. From the
>>>>>>>>>>>> coredump i have got the backtrace.
>>>>>>>>>>>>
>>>>>>>>>>>> Please find the trace below:
>>>>>>>>>>>>
>>>>>>>>>>>> (gdb) bt
>>>>>>>>>>>> #0  0x00007f38f84b52dc in _int_free ()
>>>>>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>>>>>> #1  0x00007f38f84b3a96 in free ()
>>>>>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>>>>>> #2  0x00000000004054e5 in xfree ()
>>>>>>>>>>>> #3  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>>>>>>>>> #4  0x0000000000415cda in run_bfs(long, long*, bfs_settings
>>>>>>>>>>>> const&) ()
>>>>>>>>>>>>     at bfs_custom.cpp:2036
>>>>>>>>>>>> #5  0x0000000000403a5d in main () at main.cpp:370
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Please let me know how to solve this? Am I missing some
>>>>>>>>>>>> configuration flag?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Chaitra
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>>>>>
>>>>>>>>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> -Akshay
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140825/87e26786/attachment-0001.html>


More information about the mvapich-discuss mailing list