[mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage?

Xie Min xmxmxie at gmail.com
Fri Mar 20 09:08:40 EDT 2009


Thanks for the new patch.

We did several tests today, but the testing system is busy today, so
we don't have enough node resource and time to do test.
We didn't run the HPCC to completion yet.

For 512 tasks, seems this patch works, but we just run it for 30
minutes, in which HPL was running normally.

For 1024 tasks, in one test, the HPL is deadlock when running for
about 5 minute (while using the old patch, HPL will be deadlock at the
start), but our testing system is not stable today, maybe the deadlock
is caused by network failure. We need to find enough
resource and time to do more tests.

2009/3/19 Matthew Koop <koop at cse.ohio-state.edu>:
> Xie,
>
> Can you try the following attached patch instead of the other patch? We
> found some places where a deadlock may have been allowed to occur.
>
> Thank you,
> Matt
>
> On Sat, 14 Mar 2009, Xie Min wrote:
>
>> Today, we do some other tests.
>>
>> For 128 tasks HPCC (16 nodes * 8), we can run the whole test
>> successfully and get final result.
>>
>> But for 512 tasks HPCC (64 nodes * 8), HPL is freezed too when
>> MV2_USE_LAZY_MEM_UNREGISTER=1.
>>
>> I attach an input file for 512 tasks HPCC (about 1.6GB for each task),
>> maybe you can try it on your systems to see if it will produce the
>> same problem.
>>
>> Thanks.
>>
>> 2009/3/12 Matthew Koop <koop at cse.ohio-state.edu>:
>> > Xie,
>> >
>> > Thanks for sending this information along. We've spent some time
>> > investigating the issue and came up with a patch that will hopefully
>> > resolve your issue. I've attached it to this email and it should be
>> > applied at the base directory.
>> >
>> > Please let us know if this helps the problem,
>> >
>> > Matt
>> >
>> > On Mon, 9 Feb 2009, Xie Min wrote:
>> >
>> >> The hpcc we used is HPCC 1.0.0, but we just tried HPCC 1.3.1, seems
>> >> has the same problem.
>> >>
>> >> In the attachment we attached two hpccinf.txt files for 64 HPCC tasks,
>> >> the hpccinf.txt.13 is the "RES" of about 1.3GB, while hpccinf.txt.16
>> >> is the "RES" of about 1.6/1.7GB. Whould you please try them on your
>> >> systems (with MV2_USE_LAZY_MEM_UNREGISTER=1), thanks.
>> >>
>> >> BTW, the OFED version we used is 1.3.1, physical memory on each node
>> >> is 16GB, use 8 nodes for 64 tasks.
>> >>
>> >>
>> >>
>> >> 2009/2/7 Matthew Koop <koop at cse.ohio-state.edu>:
>> >> >
>> >> > Thanks for the additional information. I've tried here with HPCC 1.3.1 and
>> >> > I haven't been able to see any difference in the 'RES' or 'VIRT' memory
>> >> > while running.
>> >> >
>> >> > Would it be possible to send me your hpccinf.txt file so I can more
>> >> > closely try to reproduce the problem? We also have AS5 with kernel 2.6.18
>> >> > as well.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Matt
>> >> >
>> >> > On Thu, 5 Feb 2009, Xie Min wrote:
>> >> >
>> >> >> We use Redhat AS5, kernel is 2.6.18 with lustre 1.6.6, and we don't
>> >> >> modify kernel source.
>> >> >>
>> >> >> We test HPCC on two clusters:
>> >> >> In one cluster, each node is booted using Boot over IB, it has no
>> >> >> harddisk, so NO swap space. We run 64 HPCC tasks on 8 nodes (so each
>> >> >> CPU core in the node will run one HPCC task), when each HPCC task use
>> >> >> 1.2/1.3G memory, it will be killed by OS because of "Out of memory"
>> >> >> error. But when MV2_USE_LAZY_MEM_UNREGISTER=0, task can use 1.7G
>> >> >> memory and run successfully.
>> >> >>
>> >> >> In another cluster, each node has harddisk, it booted from local disk,
>> >> >> and it HAS space space. We run 64 HPCC tasks on 8 nodes too. When each
>> >> >> HPCC use 1.3G memory, we use "top" to show the memory usage
>> >> >> information, we found swap will be used when HPCC is running for a
>> >> >> while, and the node begin to run very slowly and cannot respond to
>> >> >> keyboard input. But when MV2_USE_LAZY_MEM_UNREGISTER=0, each task can
>> >> >> be set to 1.7G memory scale and run successfully.
>> >> >>
>> >> >> I tried another mvapich2 parameters: MV2_USE_LAZY_MEM_UNREGISTER=1,
>> >> >> and MV2_NDREG_ENTRIES=8. In this configuration, HPCC is still be
>> >> >> killed by OS with "Out of memory" error when the memory scale of each
>> >> >> task is set to 1.3GB.
>> >> >>
>> >> >> 2009/2/5 Matthew Koop <koop at cse.ohio-state.edu>:
>> >> >> > Hi,
>> >> >> >
>> >> >> > What OS/distro are you running? Are there any changes you made, such as
>> >> >> > page size, etc from the base?
>> >> >> >
>> >> >> > I'm taking a look at this issue on our machine as well, although I'm not
>> >> >> > seeing the memory change that you reported.
>> >> >> >
>> >> >> > Matt
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >>
>> >
>>
>


More information about the mvapich-discuss mailing list