[mvapich-discuss] Node crashes when all memory is used

Gil Bloch gil at mellanox.co.il
Mon Jun 19 12:07:00 EDT 2006


Jimmy Tang wrote:
> Hi Christopher,
>
> On 6/19/06, Christopher Rowley <crowl055 at uottawa.ca> wrote:
>> I'm running a cluster of Opterons with Fedora Core 5. We have topspin 
>> HCA's
>> and Topspin 120 switches. We're using MVAPICH.gen2 to run a 
>> computational
>> chemistry program called VASP. The memory requiresments are extremely 
>> high
>> (60 GB), and occasionally exceed what is available on the nodes were 
>> running
>> on. When this happens, the program is killed, but in the process, the 
>> first
>> node on the list of hosts will crash (it remains pingable, but with no
>> connectivity or keyboard response). We don't see this behavior with 
>> vanilla
>> MPICH 1.2.7. Is there a known issue with exceeding the total available
>> memory with MVAPICH?
>
> Out of curiousity, which compiler are using? we had some similar
> problems with a lattice qcd code (though it doesnt use as much memory
> as vasp would in most cases), where if we turned off the
> "LAZY_MEMORY_DEREGISTER" option in MVAPICH or if we turned of -O2 or
> higher optimisations in our compiler (pathscale), everything seemed to
> work okay again.
In MVAPICH, there is a way to limit the registration cache size, thus 
you do not have to turn it off.
I'd recommend limiting the registration cache size to some portion of 
the total physical memory in you specific machine (the default is 
unlimited).
Look for VIADEV_DREG_CACHE_LIMIT to limit the registration cache size.
>
> I dont know if that will help, since the symptoms that we saw are
> similar to what you are seeing, it might help.
>
> Jimmy.
>



More information about the mvapich-discuss mailing list