[mvapich-discuss] Re: [Rocks-Discuss] mvapich locked memory limit and queuing system

Thu Oct 11 08:14:07 EDT 2007

I'm afraid I don't know much about SGE, but this issue is common to  
several resource managers (and any application that uses the OFED  
verbs stack / requires locked memory -- not just your favorite MPI  
implementation).  The issue is twofold:

1. Ensure that the resource manager daemons on each cluster node  
start with unlimited locked memory.  Note that changes put in /etc/ 
security/limits.conf to set unlimited locked memory are a PAM thing;  
they affect user logins, for example.  Hence, such changes do not  
affect jobs started by init.d scripts.  So you need to manually  
ensure that your daemons are started with unlimited locked memory.   
This typically entails setting the limit *before* su'ing over to  
start the resource manager in the relevant init.d script, but the  
exact details will vary.

2. As already mentioned, many (most/all?) resource managers  
rightfully want to control the resources that are allocated to a  
job.  This may include locked memory.  If so, ensure that your  
resource manager is setting unlimited locked memory for your jobs.   
The details of how to do this will obviously be resource manager- 
specific -- consult their docs.

Note that #2 will be ineffectual until #1 has been completed and you  
have restarted all your resource manager daemons across the cluster  
with unlimited locked memory limits.  Specifically: a user process  
can *decrease* the locked memory limit, but it cannot *increase* it  
above the limit from which it was started.  Hence, if your RM daemons  
start with the default 32kb locked memory limit, they are unable to  
increase it above 32kb even if you have configured the RM to run jobs  
with unlimited locked memory.

I usually test my entire cluster by running this trivial script on  
every node via the resource manager:

-----
#!/bin/sh

foo=`ulimit -l`
if test "$foo" != "unlimited"; then
     hostname
fi
-----

The only hostnames that you see do not have their memory limits set  
properly.

Hope this helps.

On Oct 10, 2007, at 4:22 PM, John Leidel wrote:

> Noam, I've also seen this happen with SGE 6.0 [I suggest also posting
> your question to the SGE users list].  It has to do with the SGE_EXECD
> arbitrarily setting its memlimits very low.  As such, all child
> processes retain the low ceiling.  Check the SGE startup scripts to  
> see
> if the execd startup section is setting the memlimits.
>
> cheers
> john
>
> On Wed, 2007-10-10 at 16:07 -0400, Noam Bernstein wrote:
>> Has anyone noticed a conflict between OFED/mvapich's desire for a
>> large locked memory
>> limit and the queuing system?   I've upped the memlock  limit in / 
>> etc/
>> security/limits.conf
>> (and rebooted, so the queuing system daemons should be seeing the
>> larger limit).
>> This allows me to run interactively.  However, jobs submitted through
>> our queuing system
>> (SGE 6.0) still see the low memlock limit (32 kB), and can't increase
>> the limit.  I found a brief
>> discussion of a similar problem in torque (this thread:
>> http://www.beowulf.org/archive/2006-May/015559.html
>> ), but no apparent solution.
>>
>> Has anyone seen this problem, and preferably discovered a solution?
>>
>> 											thanks,
>> 											Noam
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
Jeff Squyres
Cisco Systems