[mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage?

Mon Mar 23 20:23:22 EDT 2009

> Thanks for the new patch.
>
> We did several tests today, but the testing system is busy today, so
> we don't have enough node resource and time to do test.
> We didn't run the HPCC to completion yet.
>
> For 512 tasks, seems this patch works, but we just run it for 30
> minutes, in which HPL was running normally.

Good to know about this.

> For 1024 tasks, in one test, the HPL is deadlock when running for
> about 5 minute (while using the old patch, HPL will be deadlock at the
> start), but our testing system is not stable today, maybe the deadlock
> is caused by network failure. We need to find enough
> resource and time to do more tests.

Once you have a stable system, let us know how it works for 1024 tasks.

Thanks,

DK

> 2009/3/19 Matthew Koop <koop at cse.ohio-state.edu>:
> > Xie,
> > Can you try the following attached patch instead of the other patch? We
> > found some places where a deadlock may have been allowed to occur.
> >
> > Thank you,
> > Matt
> >
> > On Sat, 14 Mar 2009, Xie Min wrote:
> >
> >> Today, we do some other tests.
> >>
> >> For 128 tasks HPCC (16 nodes * 8), we can run the whole test
> >> successfully and get final result.
> >>
> >> But for 512 tasks HPCC (64 nodes * 8), HPL is freezed too when
> >> MV2_USE_LAZY_MEM_UNREGISTER=1.
> >>
> >> I attach an input file for 512 tasks HPCC (about 1.6GB for each task),
> >> maybe you can try it on your systems to see if it will produce the
> >> same problem.
> >>
> >> Thanks.
> >>
> >> 2009/3/12 Matthew Koop <koop at cse.ohio-state.edu>:
> >> > Xie,
> >> >
> >> > Thanks for sending this information along. We've spent some time
> >> > investigating the issue and came up with a patch that will hopefully
> >> > resolve your issue. I've attached it to this email and it should be
> >> > applied at the base directory.
> >> >
> >> > Please let us know if this helps the problem,
> >> >
> >> > Matt
> >> >
> >> > On Mon, 9 Feb 2009, Xie Min wrote:
> >> >
> >> >> The hpcc we used is HPCC 1.0.0, but we just tried HPCC 1.3.1, seems
> >> >> has the same problem.
> >> >>
> >> >> In the attachment we attached two hpccinf.txt files for 64 HPCC tasks,
> >> >> the hpccinf.txt.13 is the "RES" of about 1.3GB, while hpccinf.txt.16
> >> >> is the "RES" of about 1.6/1.7GB. Whould you please try them on your
> >> >> systems (with MV2_USE_LAZY_MEM_UNREGISTER=1), thanks.
> >> >>
> >> >> BTW, the OFED version we used is 1.3.1, physical memory on each node
> >> >> is 16GB, use 8 nodes for 64 tasks.
> >> >>
> >> >>
> >> >>
> >> >> 2009/2/7 Matthew Koop <koop at cse.ohio-state.edu>:
> >> >> >
> >> >> > Thanks for the additional information. I've tried here with HPCC 1.3.1 and
> >> >> > I haven't been able to see any difference in the 'RES' or 'VIRT' memory
> >> >> > while running.
> >> >> >
> >> >> > Would it be possible to send me your hpccinf.txt file so I can more
> >> >> > closely try to reproduce the problem? We also have AS5 with kernel 2.6.18
> >> >> > as well.
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Matt
> >> >> >
> >> >> > On Thu, 5 Feb 2009, Xie Min wrote:
> >> >> >
> >> >> >> We use Redhat AS5, kernel is 2.6.18 with lustre 1.6.6, and we don't
> >> >> >> modify kernel source.
> >> >> >>
> >> >> >> We test HPCC on two clusters:
> >> >> >> In one cluster, each node is booted using Boot over IB, it has no
> >> >> >> harddisk, so NO swap space. We run 64 HPCC tasks on 8 nodes (so each
> >> >> >> CPU core in the node will run one HPCC task), when each HPCC task use
> >> >> >> 1.2/1.3G memory, it will be killed by OS because of "Out of memory"
> >> >> >> error. But when MV2_USE_LAZY_MEM_UNREGISTER=0, task can use 1.7G
> >> >> >> memory and run successfully.
> >> >> >>
> >> >> >> In another cluster, each node has harddisk, it booted from local disk,
> >> >> >> and it HAS space space. We run 64 HPCC tasks on 8 nodes too. When each
> >> >> >> HPCC use 1.3G memory, we use "top" to show the memory usage
> >> >> >> information, we found swap will be used when HPCC is running for a
> >> >> >> while, and the node begin to run very slowly and cannot respond to
> >> >> >> keyboard input. But when MV2_USE_LAZY_MEM_UNREGISTER=0, each task can
> >> >> >> be set to 1.7G memory scale and run successfully.
> >> >> >>
> >> >> >> I tried another mvapich2 parameters: MV2_USE_LAZY_MEM_UNREGISTER=1,
> >> >> >> and MV2_NDREG_ENTRIES=8. In this configuration, HPCC is still be
> >> >> >> killed by OS with "Out of memory" error when the memory scale of each
> >> >> >> task is set to 1.3GB.
> >> >> >>
> >> >> >> 2009/2/5 Matthew Koop <koop at cse.ohio-state.edu>:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > What OS/distro are you running? Are there any changes you made, such as
> >> >> >> > page size, etc from the base?
> >> >> >> >
> >> >> >> > I'm taking a look at this issue on our machine as well, although I'm not
> >> >> >> > seeing the memory change that you reported.
> >> >> >> >
> >> >> >> > Matt
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >
> >>
> >
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>