[mvapich-discuss] waitsome/testsome memory allocation
Justin
luitjens at cs.utah.edu
Wed Oct 10 12:36:40 EDT 2007
Hi, after looking through the MPI source code I have come to the
conclusion that these allocations are occurring because we post the
receives later than the sends. Many of the sends complete prior to the
posting of their corresponding receives. Thus handles must be allocated
in order to receive them prior to them being placed in the unexpected
method queue. Is it possible that the handles are not being deleted
after the requests are being posted? Does your implementation hold on
to memory after it is allocated under the assumption that it will be
used again in the future?
I'm concerned that there might be a leak in your implementation with
this communication pattern. If you do retain memory under the
assumption that it will be used again eventually the program should
reach a high water. Unfortunately this does not seem to be the case
because our usage keeps going up slowly. The average is going up
because occasionally a processor will allocate a large amount of memory.
We are going to try and post the receives up front but would like to
verify that there isn't a leak also.
Justin
Justin wrote:
> Hi,
>
> The relevant stack traces on these allocations is the following:
>
>
> 1. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Thread.so [0x2a95c42284]
> 2. /lib64/tls/libc.so.6 [0x2a9a7152b0]
> 3. /lib64/tls/libc.so.6(gsignal+0x3d) [0x2a9a71521d]
> 4. /lib64/tls/libc.so.6(abort+0xfe) [0x2a9a716a1e]
> 5.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
> in __gnu_cxx::__verbose_terminate_handler()
> 6.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
> [0x2a9a499076]
> 7.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
> [0x2a9a4990a3]
> 8.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
> [0x2a9a4990b6]
> 9.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6(__cxa_call_unexpected+0x48)
> [0x2a9a498fc8]
> a. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Malloc.so(malloc+0x63)
> [0x2a980c92ff]
> b. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBiAllocate+0x39)
> [0x2a98bdce39]
> c. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBalloc+0x2b)
> [0x2a98bdcf8b]
> d. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_Msg_arrived+0xe3)
> [0x2a98bda2b3]
> e.
> /g/g20/luitjens/mpi//lib/libmpich.so.1.0(viadev_incoming_eager_start+0x43)
> [0x2a98be8753]
> f. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(viadev_process_recv+0x2ef)
> [0x2a98be9b6f]
> 10. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_DeviceCheck+0xde)
> [0x2a98bea77e]
> 11. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPI_Testsome+0x45)
> [0x2a98be1b35]
>
> And
>
> 1. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Thread.so [0x2a95c42284]
> 2. /lib64/tls/libc.so.6 [0x2a9a7152b0]
> 3. /lib64/tls/libc.so.6(gsignal+0x3d) [0x2a9a71521d]
> 4. /lib64/tls/libc.so.6(abort+0xfe) [0x2a9a716a1e]
> 5.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
> in __gnu_cxx::__verbose_terminate_handler()
> 6.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
> [0x2a9a499076]
> 7.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
> [0x2a9a4990a3]
> 8.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6
> [0x2a9a4990b6]
> 9.
> /usr/local/tools/gnu/gcc/3.4.4_RH_chaos_3_x86_64/usr/lib64/libstdc++.so.6(__cxa_call_unexpected+0x48)
> [0x2a9a498fc8]
> a. /g/g20/luitjens/SCIRunMemory/dbg/lib/libCore_Malloc.so(malloc+0x63)
> [0x2a980c92ff]
> b. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBiAllocate+0x39)
> [0x2a98bdce39]
> c. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SBalloc+0x2b)
> [0x2a98bdcf8b]
> d. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_Msg_arrived+0xe3)
> [0x2a98bda2b3]
> e. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(smpi_net_lookup+0xc24)
> [0x2a98bd3bd4]
> f.
> /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_SMP_Check_incoming+0x2d5)
> [0x2a98bd4ee5]
> 10. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPID_DeviceCheck+0x185)
> [0x2a98bea825]
> 11. /g/g20/luitjens/mpi//lib/libmpich.so.1.0(MPI_Testsome+0x45)
> [0x2a98be1b35]
>
>
> I have tried turning of the ELAN optimizations and the allocations
> still occur. The commonality in the stack traces appears to be calls
> to SBalloc. Is it possible that there is a leak in the MPI library
> that we are running into? When is the memory allocated in this
> function freed? If the same communication pattern is occurring over
> and over what would cause this function to keep allocating memory
> instead of reusing the memory that has already been allocated?
>
> Thanks
> Justin
>
>
> Justin wrote:
>> Hi,
>>
>> I am tracking down some memory issues in our code. And I am finding
>> strange memory allocations occurring within MPI_Waitsome and
>> MPI_Testsome. In one section of our code we use MPI_Pack and
>> MPI_Unpack to combine a bunch of small messages. We then send out
>> the packed messages using isend. The receiving processors post
>> irecvs. To complete the communication we use both testsome and
>> waitsome. What we are seeing is processors start by allocating a
>> small amount of memory but as the code marches forward in time
>> processors will allocate more memory within one of these mpi calls.
>> Processors continue allocating larger and larger amounts of memory as
>> time goes on. For example early on the allocation might be a couple
>> KB but eventually it will get to around 1MB and i've even seen it as
>> high as 14MB. I predict that if I ran it further it would allocate a
>> much larger amount that 14MB. Processors are not all allocating this
>> memory at the same time. In other parts of the code we do not use
>> packing and we do not see this allocation behavior. I'm guessing
>> that somewhere we are either miss-using packing or some other MPI
>> feature and are causing MPI to leak.
>>
>> I was wondering if you could tell me why testsome/waitsome would
>> allocate memory as that could provide a good hint as to how we are
>> miss-using mpi.
>>
>> Currently we are using mvapich version 0.9.9 on Atlas at LLNL
>>
>> Thanks,
>> Justin
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list