[mvapich-discuss] waitsome/testsome memory allocation

Justin luitjens at cs.utah.edu
Wed Oct 10 12:36:40 EDT 2007


Hi,  after looking through the MPI source code I have come to the 
conclusion that these allocations are occurring because we post the 
receives later than the sends.  Many of the sends complete prior to the 
posting of their corresponding receives.  Thus handles must be allocated 
in order to receive them prior to them being placed in the unexpected 
method queue.  Is it possible that the handles are not being deleted 
after the requests are being posted?  Does your implementation hold on 
to memory after it is allocated under the assumption that it will be 
used again in the future? 

I'm concerned that there might be a leak in your implementation with 
this communication pattern.  If you do retain memory under the 
assumption that it will be used again eventually the program should 
reach a high water.  Unfortunately this does not seem to be the case 
because our usage keeps going up slowly.  The average is going up 
because occasionally a processor will allocate a large amount of memory. 

We are going to try and post the receives up front but would like to 
verify that there isn't a leak also.

Justin

Justin wrote:
> Hi,
>
> The relevant stack traces on these allocations is the following:
>
>
> 1. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Thread.so [0x2a95c42284]
> 2. /lib64/tls/libc.so.6 [0x2a9a7152b0]
> 3. /lib64/tls/libc.so.6(gsignal+0x3d) [0x2a9a71521d]
> 4. /lib64/tls/libc.so.6(abort+0xfe) [0x2a9a716a1e]
> 5. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
> in __gnu_cxx::__verbose_terminate_handler()
> 6. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
> [0x2a9a499076]
> 7. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
> [0x2a9a4990a3]
> 8. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
> [0x2a9a4990b6]
> 9. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6(__cxa_call_unexpected+0x48) 
> [0x2a9a498fc8]
> a. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Malloc.so(malloc+0x63) 
> [0x2a980c92ff]
> b. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBiAllocate+0x39) 
> [0x2a98bdce39]
> c. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBalloc+0x2b) 
> [0x2a98bdcf8b]
> d. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_Msg_arrived+0xe3) 
> [0x2a98bda2b3]
> e. 
> /g/g20/luitjens/mpi//lib/libmpich.so.1.0(viadev_incoming_eager_start+0x43) 
> [0x2a98be8753]
> f. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(viadev_process_recv+0x2ef) 
> [0x2a98be9b6f]
> 10. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_DeviceCheck+0xde) 
> [0x2a98bea77e]
> 11. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPI_Testsome+0x45) 
> [0x2a98be1b35]
>
> And
>
> 1. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Thread.so [0x2a95c42284]
> 2. /lib64/tls/libc.so.6 [0x2a9a7152b0]
> 3. /lib64/tls/libc.so.6(gsignal+0x3d) [0x2a9a71521d]
> 4. /lib64/tls/libc.so.6(abort+0xfe) [0x2a9a716a1e]
> 5. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
>  in __gnu_cxx::__verbose_terminate_handler()
> 6. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
> [0x2a9a499076]
> 7. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
> [0x2a9a4990a3]
> 8. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6 
> [0x2a9a4990b6]
> 9. 
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6(__cxa_call_unexpected+0x48) 
> [0x2a9a498fc8]
> a. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Malloc.so(malloc+0x63) 
> [0x2a980c92ff]
> b. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBiAllocate+0x39) 
> [0x2a98bdce39]
> c. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBalloc+0x2b) 
> [0x2a98bdcf8b]
> d. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_Msg_arrived+0xe3) 
> [0x2a98bda2b3]
> e. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(smpi_net_lookup+0xc24) 
> [0x2a98bd3bd4]
> f. 
> /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SMP_Check_incoming+0x2d5) 
> [0x2a98bd4ee5]
> 10. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_DeviceCheck+0x185) 
> [0x2a98bea825]
> 11. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPI_Testsome+0x45) 
> [0x2a98be1b35]
>
>
> I have tried turning of the ELAN optimizations and the allocations 
> still occur.  The commonality in the stack traces appears to be calls 
> to SBalloc.  Is it possible that there is a leak in the MPI library 
> that we are running into?  When is the memory allocated in this 
> function freed?  If the same communication pattern is occurring over 
> and over what would cause this function to keep allocating memory 
> instead of reusing the memory that has already been allocated?
>
> Thanks
> Justin
>
>
> Justin wrote:
>> Hi,
>>
>> I am tracking down some memory issues in our code.  And I am finding 
>> strange memory allocations occurring within MPI_Waitsome and 
>> MPI_Testsome.  In one section of our code we use MPI_Pack and 
>> MPI_Unpack to combine a bunch of small messages.  We then send out 
>> the packed messages using isend.  The receiving processors post 
>> irecvs.  To complete the communication we use both testsome and 
>> waitsome.  What we are seeing is processors start by allocating a 
>> small amount of memory but as the code marches forward in time 
>> processors will allocate more memory within one of these mpi calls.  
>> Processors continue allocating larger and larger amounts of memory as 
>> time goes on.  For example early on the allocation might be  a couple 
>> KB but eventually it will get to around 1MB and i've even seen it as 
>> high as 14MB.  I predict that if I ran it further it would allocate a 
>> much larger amount that 14MB.  Processors are not all allocating this 
>> memory at the same time.   In other parts of the code we do not use 
>> packing and we do not see this allocation behavior.  I'm guessing 
>> that somewhere we are either miss-using packing or some other MPI 
>> feature and are causing MPI to leak.
>>
>> I was wondering if you could tell me why testsome/waitsome would 
>> allocate memory as that could provide a good hint as to how we are 
>> miss-using mpi.
>>
>> Currently we are using mvapich version 0.9.9  on Atlas at LLNL
>>
>> Thanks,
>> Justin
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list