[mvapich-discuss] waitsome/testsome memory allocation
    Justin 
    luitjens at cs.utah.edu
       
    Wed Oct 10 00:14:10 EDT 2007
    
    
  
Hi,
The relevant stack traces on these allocations is the following:
1. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Thread.so [0x2a95c42284]
2. /lib64/tls/libc.so.6 [0x2a9a7152b0]
3. /lib64/tls/libc.so.6(gsignal+0x3d) [0x2a9a71521d]
4. /lib64/tls/libc.so.6(abort+0xfe) [0x2a9a716a1e]
5. 
/usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
 in __gnu_cxx::__verbose_terminate_handler()
6. 
/usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
[0x2a9a499076]
7. 
/usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
[0x2a9a4990a3]
8. 
/usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
[0x2a9a4990b6]
9. 
/usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6(__cxa_call_unexpected+0x48) 
[0x2a9a498fc8]
a. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Malloc.so(malloc+0x63) 
[0x2a980c92ff]
b. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBiAllocate+0x39) 
[0x2a98bdce39]
c. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBalloc+0x2b) 
[0x2a98bdcf8b]
d. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_Msg_arrived+0xe3) 
[0x2a98bda2b3]
e. 
/g/g20/luitjens/mpi//lib/libmpich.so.1.0(viadev_incoming_eager_start+0x43) 
[0x2a98be8753]
f. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(viadev_process_recv+0x2ef) 
[0x2a98be9b6f]
10. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_DeviceCheck+0xde) 
[0x2a98bea77e]
11. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPI_Testsome+0x45) 
[0x2a98be1b35]
And
1. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Thread.so [0x2a95c42284]
2. /lib64/tls/libc.so.6 [0x2a9a7152b0]
3. /lib64/tls/libc.so.6(gsignal+0x3d) [0x2a9a71521d]
4. /lib64/tls/libc.so.6(abort+0xfe) [0x2a9a716a1e]
5. /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
  in __gnu_cxx::__verbose_terminate_handler()
6. 
/usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
[0x2a9a499076]
7. 
/usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
[0x2a9a4990a3]
8. 
/usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
[0x2a9a4990b6]
9. 
/usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6(__cxa_call_unexpected+0x48) 
[0x2a9a498fc8]
a. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Malloc.so(malloc+0x63) 
[0x2a980c92ff]
b. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBiAllocate+0x39) 
[0x2a98bdce39]
c. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBalloc+0x2b) 
[0x2a98bdcf8b]
d. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_Msg_arrived+0xe3) 
[0x2a98bda2b3]
e. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(smpi_net_lookup+0xc24) 
[0x2a98bd3bd4]
f. 
/g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SMP_Check_incoming+0x2d5) 
[0x2a98bd4ee5]
10. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_DeviceCheck+0x185) 
[0x2a98bea825]
11. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPI_Testsome+0x45) 
[0x2a98be1b35]
I have tried turning of the ELAN optimizations and the allocations still 
occur.  The commonality in the stack traces appears to be calls to 
SBalloc.  Is it possible that there is a leak in the MPI library that we 
are running into?  When is the memory allocated in this function freed?  
If the same communication pattern is occurring over and over what would 
cause this function to keep allocating memory instead of reusing the 
memory that has already been allocated?
Thanks
Justin
Justin wrote:
> Hi,
>
> I am tracking down some memory issues in our code.  And I am finding 
> strange memory allocations occurring within MPI_Waitsome and 
> MPI_Testsome.  In one section of our code we use MPI_Pack and 
> MPI_Unpack to combine a bunch of small messages.  We then send out the 
> packed messages using isend.  The receiving processors post irecvs.  
> To complete the communication we use both testsome and waitsome.  What 
> we are seeing is processors start by allocating a small amount of 
> memory but as the code marches forward in time processors will 
> allocate more memory within one of these mpi calls.  Processors 
> continue allocating larger and larger amounts of memory as time goes 
> on.  For example early on the allocation might be  a couple KB but 
> eventually it will get to around 1MB and i've even seen it as high as 
> 14MB.  I predict that if I ran it further it would allocate a much 
> larger amount that 14MB.  Processors are not all allocating this 
> memory at the same time.   In other parts of the code we do not use 
> packing and we do not see this allocation behavior.  I'm guessing that 
> somewhere we are either miss-using packing or some other MPI feature 
> and are causing MPI to leak.
>
> I was wondering if you could tell me why testsome/waitsome would 
> allocate memory as that could provide a good hint as to how we are 
> miss-using mpi.
>
> Currently we are using mvapich version 0.9.9  on Atlas at LLNL
>
> Thanks,
> Justin
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
    
    
More information about the mvapich-discuss
mailing list