[mvapich-discuss] mvapich2 + MALLOC_CHECK_

Dan Kokron daniel.kokron at nasa.gov
Thu Jun 3 13:40:39 EDT 2010


I originally enabled the MALLOC_CHECK_ feature in order to investigate a
failure during MPI_finalize.  Setting it to 1 should allow the program
to proceed, but won't provide much new information regarding the
finalize failure.

I guess I'm stuck in the middle here.  The application I am debugging is
pure MPI, so having a less efficient multi-threaded ptmalloc isn't a
problem.  Can you estimate how much work it would be for me to swap out
ptmalloc2 for ptmalloc3 in my mvapich2 sandbox?

Dan

p.s.
Just FYI, I looked at the ptmalloc3 code.  It appears to be based on a
pre-release version of Doug Lea's malloc-2.8.4.

p.p.s.
While we are on the topic of memory allocators, has your group looked
into using HOARD? http://www.hoard.org/

p.p.p.s.
The finalize failure has the following trace

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libibverbs.so.1    00002B527AA70F2E  Unknown               Unknown  Unknown
GEOSgcm.x          000000000A9AB4F2  MPIDI_CH3I_CM_Fin        1501  rdma_iba_init.c
GEOSgcm.x          000000000A9A1412  MPIDI_CH3_Finaliz          57  ch3_finalize.c
GEOSgcm.x          000000000A962CA9  MPID_Finalize             170  mpid_finalize.c
GEOSgcm.x          000000000A8A68BC  PMPI_Finalize             168  finalize.c
GEOSgcm.x          000000000A806EAF  MPI_Finalize             1534  TauMpi.c
GEOSgcm.x          0000000008C2BBFA  _ZN5ESMCI3VMK8fin         470  ESMCI_VMKernel.C
GEOSgcm.x          0000000008C3E07D  _ZN5ESMCI2VM8fina        1459  ESMCI_VM.C
GEOSgcm.x          0000000009D39246  c_esmc_vmfinalize         826  ESMCI_VM_F.C
GEOSgcm.x          0000000009AF7940  esmf_vmmod_mp_esm        6152  ESMF_VM.F90
GEOSgcm.x          00000000095C66A0  esmf_initmod_mp_e         513  ESMF_Init.F90
GEOSgcm.x          0000000008694D35  mapl_capmod_mp_ma         618  MAPL_Cap.pp.inst.F90
GEOSgcm.x          00000000004A4B75  MAIN__                    171  GEOSgcm.pp.inst.F90


On Wed, 2010-06-02 at 19:07 -0500, Sayantan Sur wrote:
> Hi Dan,
> 
> I looked into this issue. It seems that this is a bug in ptmalloc2.
> MVAPICH/MVAPICH2 uses ptmalloc2 implementation of malloc to provide
> safe registration/de-registration. It seems that the malloc checking
> for memory allocated through 'valloc' might be buggy.
> 
> This bug seems to have been solved in ptmalloc3. The last time we
> looked into upgrading to ptmalloc3, we saw this message on ptmalloc's
> website. "In multi-thread Applications, ptmalloc2 is currently
> slightly more memory-efficient than ptmalloc3."
> [http://www.malloc.de/en/] We decided not to upgrade to ptmalloc3.
> 
> If you use MALLOC_CHECK_=1, then you will get a warning, but your
> program will proceed. Presumably, you chose to use this checking to
> find bugs in your MPI program? Maybe you can overlook this one warning
> for now and let us know how it works. We will also investigate
> ptmalloc3 and plan to incorporate this in future release.
> 
> Thanks.
> 
> On Wed, Jun 2, 2010 at 3:47 PM, Sayantan Sur <surs at cse.ohio-state.edu> wrote:
> > Hi Dan,
> >
> > Thanks for reporting this. I don't think anyone has reported this
> > earlier. I was able to reproduce on our systems, and am currently
> > looking into this issue.
> >
> > Thanks.
> >
> > On Tue, Jun 1, 2010 at 6:25 PM, Dan Kokron <daniel.kokron at nasa.gov> wrote:
> >> I am attempting to debug an application that fails during MPI_Finalize.
> >> After trying the usual debugging options (-g etc), I set MALLOC_CHECK_=2
> >> to see what would happen.  It now fails with the following trace during
> >> MPI_Init.  I didn't see any mention of this issue in the archives.
> >> Maybe I missed it.
> >>
> >> #0  0x00000000052e5bb5 in raise () from /lib64/libc.so.6
> >> #1  0x00000000052e6fb0 in abort () from /lib64/libc.so.6
> >> #2  0x00000000005718f9 in for__signal_handler ()
> >> #3  <signal handler called>
> >> #4  0x00000000052e5bb5 in raise () from /lib64/libc.so.6
> >> #5  0x00000000052e6fb0 in abort () from /lib64/libc.so.6
> >> #6  0x0000000000412126 in free_check (mem=0x4138000, caller=0x0) at hooks.c:274
> >> #7  0x000000000041480a in free (mem=0x4138000) at mvapich_malloc.c:3443
> >> #8  0x00000000004180ce in mvapich2_minit () at mem_hooks.c:86
> >> #9  0x00000000005526a8 in MPIDI_CH3I_RDMA_init (pg=0x411f618, pg_rank=21) at rdma_iba_init.c:153
> >> #10 0x000000000054d148 in MPIDI_CH3_Init (has_parent=0, pg=0x411f618, pg_rank=21) at ch3_init.c:161
> >> #11 0x00000000004d9cce in MPID_Init (argc=0x0, argv=0x0, requested=0, provided=0x7feffba78, has_args=0x7feffba80, has_env=0x7feffba7c) at mpid_init.c:189
> >> #12 0x0000000000435780 in MPIR_Init_thread (argc=0x0, argv=0x0, required=0, provided=0x0) at initthread.c:305
> >> #13 0x0000000000434582 in PMPI_Init (argc=0x0, argv=0x0) at init.c:135
> >> #14 0x0000000000410e0f in pmpi_init_ (ierr=0x7feffe774) at initf.c:129
> >> #15 0x000000000040bdbf in gcrm_test_io () at gcrm_test_io.f90:27
> >> #16 0x000000000040bcdc in main ()
> >>
> >> Valgrind-3.5.0 gives the following
> >>
> >> ==21574== Conditional jump or move depends on uninitialised value(s)
> >> ==21574==    at 0x41182C: mem2chunk_check (hooks.c:165)
> >> ==21574==    by 0x4120C3: free_check (hooks.c:268)
> >> ==21574==    by 0x414809: free (mvapich_malloc.c:3443)
> >> ==21574==    by 0x4180CD: mvapich2_minit (mem_hooks.c:86)
> >> ==21574==    by 0x5526A7: MPIDI_CH3I_RDMA_init (rdma_iba_init.c:153)
> >> ==21574==    by 0x54D147: MPIDI_CH3_Init (ch3_init.c:161)
> >> ==21574==    by 0x4D9CCD: MPID_Init (mpid_init.c:189)
> >> ==21574==    by 0x43577F: MPIR_Init_thread (initthread.c:305)
> >> ==21574==    by 0x434581: PMPI_Init (init.c:135)
> >> ==21574==    by 0x410E0E: mpi_init_ (initf.c:129)
> >> ==21574==    by 0x40BDBE: MAIN__ (gcrm_test_io.f90:27)
> >> ==21574==    by 0x40BCDB: main (in /gpfsm/dhome/dkokron/play/mpi-io/gcrm_test_io.x)
> >> ==21574==  Uninitialised value was created
> >> ==21574==    at 0x536FC7A: brk (in /lib64/libc-2.4.so)
> >> ==21574==    by 0x536FD41: sbrk (in /lib64/libc-2.4.so)
> >> ==21574==    by 0x418251: mvapich2_sbrk (mem_hooks.c:148)
> >> ==21574==    by 0x414058: sYSMALLOc (mvapich_malloc.c:2983)
> >> ==21574==    by 0x41647E: _int_malloc (mvapich_malloc.c:4318)
> >> ==21574==    by 0x411FE8: malloc_check (hooks.c:252)
> >> ==21574==    by 0x414607: malloc (mvapich_malloc.c:3395)
> >> ==21574==    by 0x4113AA: malloc_hook_ini (hooks.c:28)
> >> ==21574==    by 0x414607: malloc (mvapich_malloc.c:3395)
> >> ==21574==    by 0x57E382: for__get_vm (in /gpfsm/dhome/dkokron/play/mpi-io/gcrm_test_io.x)
> >> ==21574==    by 0x5722B2: for_rtl_init_ (in /gpfsm/dhome/dkokron/play/mpi-io/gcrm_test_io.x)
> >> ==21574==    by 0x40BCD6: main (in /gpfsm/dhome/dkokron/play/mpi-io/gcrm_test_io.x)
> >>
> >> I am using mvapich2-1.4-2010-05-25 configured as follows
> >>
> >> ./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS="-DRDMA_CM -fpic
> >> -O0 -traceback -debug" CXXFLAGS="-DRDMA_CM -fpic -O0 -traceback -debug"
> >> FFLAGS="-fpic -O0 -traceback -debug -nolib-inline -check bounds -check
> >> uninit -fp-stack-check -ftrapuv" F90FLAGS="-fpic -O0 -traceback -debug
> >> -nolib-inline -check bounds -check uninit -fp-stack-check -ftrapuv"
> >> --prefix=/discover/nobackup/dkokron/mv2-1.4.1_debug
> >> --enable-error-checking=all --enable-error-messages=all --enable-g=all
> >> --enable-f77 --enable-f90 --enable-cxx --enable-mpe --enable-romio
> >> --enable-threads=multiple --with-rdma=gen2
> >>
> >> on Linux
> >> 2.6.16.60-0.42.5-smp
> >>
> >> and Intel compilers (v 11.0.083)
> >>
> >> Note that line number 86 in my mem_hooks.c is (I added some debug
> >> prints)
> >>
> >>    free(ptr_calloc);
> >> --->free(ptr_valloc);  <---
> >>    free(ptr_memalign);
> >>
> >> --
> >> Dan Kokron
> >> Global Modeling and Assimilation Office
> >> NASA Goddard Space Flight Center
> >> Greenbelt, MD 20771
> >> Daniel.S.Kokron at nasa.gov
> >> Phone: (301) 614-5192
> >> Fax:   (301) 614-5304
> >>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
> >
> >
> >
> > --
> > Sayantan Sur
> >
> > Research Scientist
> > Department of Computer Science
> > The Ohio State University.
> >
> 
> 
> 
-- 
Dan Kokron
Global Modeling and Assimilation Office
NASA Goddard Space Flight Center
Greenbelt, MD 20771
Daniel.S.Kokron at nasa.gov
Phone: (301) 614-5192
Fax:   (301) 614-5304



More information about the mvapich-discuss mailing list