[mvapich-discuss] Node crashes when all memory is used

Matthew Koop koop at cse.ohio-state.edu
Mon Jun 19 11:22:16 EDT 2006


Christopher,

Does this problem occur at startup or later in program execution? Also,
I'm assuming you are using 0.9.7? If not, I suggest upgrading to get the
latest performance features we have available.

I'll echo the suggestion of Jimmy and suggest disabling
LAZY_MEM_UNREGISTER from the CFLAGS in make.mvapich.gen2. This may have a
slight performance impact on your program execution, but should eliminate
any memory capacity issues.

Also, using MPD to start your jobs will also help clean up processes after
a job is killed a bit more gracefully than mpirun_rsh. More information
about using MPD support is in the user manual:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich_user_guide.pdf

Let us know if you have any other problems or if this solves your issues.

Thanks,

Matthew Koop
-
Network-Based Computing Lab
Ohio State University


On Sun, 18 Jun 2006, Christopher Rowley wrote:

> Hi,
>
> I'm running a cluster of Opterons with Fedora Core 5. We have topspin
> HCA's and Topspin 120 switches. We're using MVAPICH.gen2 to run a
> computational chemistry program called VASP. The memory requiresments
> are extremely high (60 GB), and occasionally exceed what is available on
> the nodes were running on. When this happens, the program is killed, but
> in the process, the first node on the list of hosts will crash (it
> remains pingable, but with no connectivity or keyboard response). We
> don't see this behavior with vanilla MPICH 1.2.7. Is there a known issue
> with exceeding the total available memory with MVAPICH?
>
> Thanks,
> Christopher Rowley
> Department of Chemistry
> University of Ottawa
>



More information about the mvapich-discuss mailing list