[mvapich-discuss] Segmentation fault while running application

Chaitra Kumar chaitragkumar at gmail.com
Mon Aug 18 05:08:08 EDT 2014


Hi Hari and Team,

To rule out any issues with existing libraries we reimaged the OS,
re-installed all RDMA related drivers.  Still the problem persists.

I also ran the application using valgrind.

The command I used was:

mpirun_rsh -np 72 -hostfile hostfile  MV2_DEBUG_CORESIZE=unlimited
MV2_DEBUG_SHOW_BACKTRACE=1  MV2_ENABLE_AFFINITY=0 valgrind --tool=memcheck
--leak-check=full --track-origins=yes --show-reachable=yes
./graph500_mpi_custom_72 28

It generated 72 core files.


Below is the backtrace generated for one of processes:

#0  0x000000321f80f5db in raise () from /lib64/libpthread.so.0
#1  <signal handler called>
#2  0x000000000574b6ea in MPL_trfree ()
   from /home/padmanac/mvapich-gdb/lib/libmpl.so.1
#3  0x0000000004ed747c in MPIU_trfree (a_ptr=0xf0e0d0c9, line=4139,
    fname=0x5292bdc "src/mpi/coll/ch3_shmem_coll.c") at
src/util/mem/trmem.c:37
#4  0x0000000005032868 in mv2_shm_coll_cleanup (shmem=0x1621cc28)
    at src/mpi/coll/ch3_shmem_coll.c:4139
#5  0x000000000517729e in free_2level_comm (comm_ptr=0x10697c18)
    at src/mpi/comm/create_2level_comm.c:144
#6  0x0000000004eca438 in MPIR_Comm_delete_internal (comm_ptr=0x10697c18,
    isDisconnect=0) at src/mpi/comm/commutil.c:1918
#7  0x000000000516a68e in MPIR_Comm_release (comm_ptr=0x10697c18,
    isDisconnect=0) at ./src/include/mpiimpl.h:1331
#8  0x000000000516a9f7 in PMPI_Comm_free (comm=0x7feffebb4)
    at src/mpi/comm/comm_free.c:124
#9  0x0000000000408f8d in scatter_bitmap_set::~scatter_bitmap_set (
    this=0x7feffeba0, __in_chrg=<value optimized out>) at onesided.hpp:271
#10 0x0000000000406b7c in validate_bfs_result (tg=0x7fefff1a0,
    nglobalverts=268435456, nlocalverts=4194304, root=31958113,
    pred=0xa73a10a8, edge_visit_count_ptr=0x7fefff118) at validate.cpp:449
#11 0x0000000000403737 in main (argc=2, argv=0x7fefff498) at main.cpp:381

Valgrind generated error log:
[polaris-1:mpi_rank_36][error_sighandler] Caught error: Segmentation fault
(sign
al 11)
==131765== Invalid read of size 8
==131765==    at 0x574B6EA: MPL_trfree (in
/home/padmanac/mvapich-gdb/lib/libmpl
.so.1.0.0)
==131765==    by 0x4ED747B: MPIU_trfree (trmem.c:37)
==131765==    by 0x5032867: mv2_shm_coll_cleanup (ch3_shmem_coll.c:4139)
==131765==    by 0x517729D: free_2level_comm (create_2level_comm.c:144)
==131765==    by 0x4ECA437: MPIR_Comm_delete_internal (commutil.c:1918)
==131765==    by 0x516A68D: MPIR_Comm_release.clone.0 (mpiimpl.h:1331)
==131765==    by 0x516A9F6: PMPI_Comm_free (comm_free.c:124)
==131765==    by 0x408F8C: scatter_bitmap_set::~scatter_bitmap_set()
(onesided.h
pp:271)
==131765==    by 0x406B7B: validate_bfs_result(tuple_graph const*, long,
unsigne
d long, long, long*, long*) (validate.cpp:449)
==131765==    by 0x403736: main (main.cpp:381)
==131765==  Address 0xf0e0d0b9 is not stack'd, malloc'd or (recently) free'd
==131765==


When I run the experiment with openMPI it finishes without any error.

Any pointers to fix this is highly appreciated.

Thanks for help.

Regards,
Chaitra



On Fri, Aug 8, 2014 at 12:42 AM, Chaitra Kumar <chaitragkumar at gmail.com>
wrote:

> Hi Hari,
>
> Thanks for your reply.  I will try to run the application with valgrind.
> We have a requirement to use the tuned MPI version of Graph500 and more
> over this implementation works perfectly  fine for many combinations of
> scale on processes and only in some cases the  it crashes.
>
>
> Based on the trace i got eariler, please let me know if following possible
> cause of runtime error:
>
> *#0 0x00007f8ca65ca643 in free (mem=0x7f8bb5919010) at
> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485*
> #1 0x000000000040915c in bitmap::clear (this=<value optimized out>) at
> bitmap.cpp:54
> #2 0x0000000000411cbe in run_bfs (root_raw=4685000, pred=0x7f8bbd9210a8,
> settings=...) at bfs_custom.cpp:2036
> #3 0x00000000004032ca in main (argc=2, argv=0x7fff5e9ce0d8) at main.cpp:369
>
> -------------------------------------------------------------
>
> It is possible that when the bitmap object is created via the MPI
> application (that is, the Graph 500),   the memory allocator used is the
> default  C++ memory allocation.  However, at the crash point, where the
> “free” method is invoked, the “free” method is from  the external memory
> allocator  called “ptmalloc2”,  based on the captured trace information of “
> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485”
> . Different memory allocators have different boundary guards (signatures).
> When “free” is invoked,  it finds that the guard is not what is supposed to
> be, should the corresponding ptmalloc2 ‘s  “malloc” is invoked, and thus
> declares memory corruption and the application crashes.
>
>
>
> In summary, this likely is the memory allocator consistency problem, such
> that the MPI application uses one memory allocator to “malloc” and the MPI
> runtime uses a different memory allocator to “free”.  *So we need to find
> out why MPI runtime comes to free the memory allocated by the MPI
> application?  Should the data be copied across the MPI application/MPI
> runtime boundary? *
>
> The typical way to have the entire application stack + middleware runtime
> to  use a common memory allocator is   to package  the memory allocator
> separately as a shared library, and then having the application to
> dynamically link to this library. Further, in the SHELL environment of the
> user that invokes the application, export “LD_LIBRARY_PATH” to include the
> path where this memory allocator’s corresponding shared library is located,
> so that the loader will load this external memory allocator and make sure
> that  the function entry points : malloc() and free() are pointed to the
> ones provided by this external memory allocator.  Such an approach is
> described by the Google’s TcMalloc:
> http://goog-perftools.sourceforge.net/doc/tcmalloc.html.
>
>
>
> But it seems that from the tracing information, MVAPICH actually
> incorporates the ptmalloc2  at the source code level, instead of via a
> dynamic shared library loading.
>
>
> If this is the possible cause, how to fix it?  Please let me know.
>
> Regards,
> Chaitra
>
>
>
>
> On Thu, Aug 7, 2014 at 6:58 PM, Hari Subramoni <subramoni.1 at osu.edu>
> wrote:
>
>> Hello Chaitra,
>>
>> From the backtrace, it looks to be some memory corruption or out of
>> memory condition. I do not think it is related the MPI library. Can you try
>> running the application with valgrind or some other memory checker to see
>> if there is some memory over run / leak in the Graph500 code?
>>
>> In the past, the following version of Graph500 worked fine for us. Would
>> it be possible for you to try this out as well?
>>
>> http://www.graph500.org/sites/default/files/files/graph500-1.2.tar.bz2
>>
>> Regards,
>> Hari.
>>
>>
>> On Thu, Aug 7, 2014 at 7:19 AM, Chaitra Kumar <chaitragkumar at gmail.com>
>> wrote:
>>
>>> Hi Hari,
>>>
>>> I had earlier compiled the code in gcc 4.8.2.  Today i recompiled it
>>> with gcc 4.4.7 and tried running Graph500.
>>>
>>> The configuration i used was:
>>>
>>> ./configure --prefix=/home/padmanac/mvapich2 --enable-cxx
>>> --enable-threads=multiple --with-device=ch3:mrail --with-rdma=gen2 *--disable-fast
>>> --enable-g=all *
>>>
>>> *--enable-error-messages=all*
>>>
>>>
>>>
>>>
>>> The latest stacktrace is as below:
>>> *#0 0x00007f8ca65ca643 in free (mem=0x7f8bb5919010) at
>>> src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3485*
>>> #1 0x000000000040915c in bitmap::clear (this=<value optimized out>) at
>>> bitmap.cpp:54
>>> #2 0x0000000000411cbe in run_bfs (root_raw=4685000, pred=0x7f8bbd9210a8,
>>> settings=...) at bfs_custom.cpp:2036
>>> #3 0x00000000004032ca in main (argc=2, argv=0x7fff5e9ce0d8) at
>>> main.cpp:369
>>>
>>> When I rebuilt MVAPICH2 with '*--disable-**registration-cache*', the  trace was was:
>>>
>>>
>>>
>>>
>>> #0  0x000000321f032925 in raise () from /lib64/libc.so.6
>>> #1  0x000000321f034105 in abort () from /lib64/libc.so.6
>>> #2  0x000000321f070837 in __libc_message () from /lib64/libc.so.6
>>> #3  0x000000321f076166 in malloc_printerr () from /lib64/libc.so.6
>>>
>>>
>>>
>>>
>>> #4  0x000000000040915c in bitmap::clear (this=<value optimized out>) at bitmap.cpp:54
>>> #5  0x0000000000411c64 in run_bfs (root_raw=4795152, pred=0x7fdfb570f0a8, settings=...) at bfs_custom.cpp:2032
>>> #6  0x00000000004032ca in main (argc=2, argv=0x7fff70f0c498) at main.cpp:369
>>>
>>>
>>>
>>>
>>> Please let me know if you need more information.
>>>
>>> Regards,
>>>
>>>
>>>
>>>
>>> Chaitra
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 6, 2014 at 11:33 PM, Chaitra Kumar <chaitragkumar at gmail.com>
>>> wrote:
>>>
>>>> Hi Hari,
>>>>
>>>> I followed the steps specified by you.
>>>>
>>>> Still execution fails. The new trace is below:
>>>>
>>>> (gdb) bt
>>>> #0  0x0000003ea8a32925 in raise () from /lib64/libc.so.6
>>>> #1  0x0000003ea8a34105 in abort () from /lib64/libc.so.6
>>>> #2  0x0000003ea8a70837 in __libc_message () from /lib64/libc.so.6
>>>> #3  0x0000003ea8a76166 in malloc_printerr () from /lib64/libc.so.6
>>>> #4  0x00000000004054e5 in xfree ()
>>>> #5  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>> #6  0x0000000000415c6e in run_bfs(long, long*, bfs_settings const&) ()
>>>>     at bfs_custom.cpp:2032
>>>> #7  0x0000000000403a5d in main () at main.cpp:370
>>>>
>>>>
>>>> Regards,
>>>> Chaitra
>>>>
>>>>
>>>> On Wed, Aug 6, 2014 at 6:47 PM, Hari Subramoni <subramoni.1 at osu.edu>
>>>> wrote:
>>>>
>>>>> Hello Chaitra,
>>>>>
>>>>> Can you try rebuilding mvapich2 with the
>>>>> "--disable-registration-cache" configure flag?
>>>>>
>>>>> ./configure --disable-registration-cache <other options>; make clean;
>>>>> make -j 4; make install
>>>>>
>>>>> Once you've done this, please recompile your application and give it a
>>>>> shot.
>>>>>
>>>>> Regards,
>>>>> Hari.
>>>>>
>>>>>
>>>>> On Wed, Aug 6, 2014 at 1:47 AM, Chaitra Kumar <chaitragkumar at gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Team,
>>>>>>
>>>>>> Even with the setting  "-env MV2_USE_LAZY_MEM_UNREGISTER=0", there is
>>>>>> no change in error and trace.
>>>>>>
>>>>>> Pasting the backtrace again:
>>>>>> (gdb) bt
>>>>>> #0  0x00007f5cdd8012dc in _int_free ()
>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>> #1  0x00007f5cdd7ffa96 in free ()
>>>>>>
>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>> #2  0x00000000004054e5 in xfree ()
>>>>>> #3  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>>> #4  0x0000000000415cda in run_bfs(long, long*, bfs_settings const&) ()
>>>>>>     at bfs_custom.cpp:2036
>>>>>> #5  0x0000000000403a5d in main () at main.cpp:370
>>>>>>
>>>>>> any help is highly appreciated.
>>>>>>
>>>>>> Regards,
>>>>>> Chaitra
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 6, 2014 at 1:17 AM, Hari Subramoni <subramoni.1 at osu.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Chaitra,
>>>>>>>
>>>>>>> Can you try running after setting "-env
>>>>>>> MV2_USE_LAZY_MEM_UNREGISTER=0"?
>>>>>>>
>>>>>>> I'm cc'ing this note to our internal developer list. I would
>>>>>>> appreciate it if you could respond to this e-mail so that we can give
>>>>>>> feedback.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Hari.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 5, 2014 at 3:13 PM, Chaitra Kumar <
>>>>>>> chaitragkumar at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Hari,
>>>>>>>>
>>>>>>>> Thanks for the quick reply.
>>>>>>>>  I am using tuned MPI implementation available on Graph500 site (
>>>>>>>> http://www.graph500.org/referencecode).  I haven't modified this
>>>>>>>> code.
>>>>>>>>
>>>>>>>> Only for some experiments it throws segmentation fault, other
>>>>>>>> experiments complete without any errors. For eg.,
>>>>>>>>
>>>>>>>> The following command fails with segmentation fault:
>>>>>>>>
>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited -env
>>>>>>>> MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>> ./graph500_mpi_custom_50 *29*
>>>>>>>>
>>>>>>>> whereas if i change the scale to 28 or 30 the same code works
>>>>>>>> without any error:
>>>>>>>>
>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited -env
>>>>>>>> MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>> ./graph500_mpi_custom_50 *30* .
>>>>>>>>
>>>>>>>> Please let me know if you want me to run with some other options.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Chaitra
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 6, 2014 at 12:20 AM, Hari Subramoni <
>>>>>>>> subramoni.1 at osu.edu> wrote:
>>>>>>>>
>>>>>>>>> Hello Chaitra,
>>>>>>>>>
>>>>>>>>> From the backtrace it seems that the failure (possibly a double
>>>>>>>>> free) is happening in the application code. You mentioned that the Graph500
>>>>>>>>> is a tuned version. Does this mean that you have made local code changes to
>>>>>>>>> it? If there are, could you try using an unmodified version of Graph500 and
>>>>>>>>> see if the same failure happens?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Hari.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 5, 2014 at 1:57 PM, Chaitra Kumar <
>>>>>>>>> chaitragkumar at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Team,
>>>>>>>>>>
>>>>>>>>>> I am trying to run Graph500 (tuned mpi version) on MVAPICH2-2.0.
>>>>>>>>>> The machine has infiniband.
>>>>>>>>>>
>>>>>>>>>> I am using following configuration to build MVAPICH2 (have
>>>>>>>>>> enabled debugging options):
>>>>>>>>>>
>>>>>>>>>> ./configure --prefix=/home/padmanac/mvapich2 --enable-cxx
>>>>>>>>>> --enable-threads=multiple --with-device=ch3:mrail --with-rdma=gen2
>>>>>>>>>> --disable-fast --enable-g=all --enable-error-messages=all
>>>>>>>>>>
>>>>>>>>>> The command which I am using is to launch Graph500 is:
>>>>>>>>>> mpiexec -f hostfile -np 50  -env MV2_DEBUG_CORESIZE=unlimited
>>>>>>>>>> -env MV2_DEBUG_SHOW_BACKTRACE=1  -env MV2_ENABLE_AFFINITY=0
>>>>>>>>>> ./graph500_mpi_custom_50 29
>>>>>>>>>>
>>>>>>>>>> This command always result in segmentation fault. From the
>>>>>>>>>> coredump i have got the backtrace.
>>>>>>>>>>
>>>>>>>>>> Please find the trace below:
>>>>>>>>>>
>>>>>>>>>> (gdb) bt
>>>>>>>>>> #0  0x00007f38f84b52dc in _int_free ()
>>>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>>>> #1  0x00007f38f84b3a96 in free ()
>>>>>>>>>>    from /home/padmanac/mvapich2/lib/libmpich.so.12
>>>>>>>>>> #2  0x00000000004054e5 in xfree ()
>>>>>>>>>> #3  0x000000000040ca10 in bitmap::clear() () at bitmap.cpp:54
>>>>>>>>>> #4  0x0000000000415cda in run_bfs(long, long*, bfs_settings
>>>>>>>>>> const&) ()
>>>>>>>>>>     at bfs_custom.cpp:2036
>>>>>>>>>> #5  0x0000000000403a5d in main () at main.cpp:370
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please let me know how to solve this? Am I missing some
>>>>>>>>>> configuration flag?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Chaitra
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140818/b581f578/attachment-0001.html>


More information about the mvapich-discuss mailing list