[OOD-users] no environment set for HPC desktop -- job fails

Rodgers, Morgan E. mrodgers at osc.edu
Tue Dec 11 14:52:43 EST 2018


Folks,

I wanted to call out a feature in the forthcoming 1.4 release of OOD: resource manager client executable overrides. The cluster configuration YAML will support an option called "bin_overrides" in the job configuration which allows you to specify the path to a resource manager executable such as sbatch or qsub that is in a different location than the default. We are hoping that this ability will solve a few problems for the community. First is we have heard that at least one site wanted to use a client program that was not located in the default client executable directory. This will also give sites the ability to write custom wrappers such as submit filters, and environment setting wrappers to the standard submission programs.

This second ability is an advanced usage; for most of our adapters OOD expects to be able to capture return codes and parse STDOUT, but if return codes are properly handled, and the STDOUT formatting is not changed, then anything should be possible. At this time I do not have any examples of this usage, but generating some is on my todo list.

https://osc.github.io/ood-documentation/develop/installation/resource-manager/slurm.html

- Morgan

________________________________
From: OOD-users <ood-users-bounces+mrodgers=osc.edu at lists.osc.edu> on behalf of Michael Coleman via OOD-users <ood-users at lists.osc.edu>
Sent: Monday, December 10, 2018 8:04:34 PM
To: Jeremy Nicklas; John-Paul Robinson; User support mailing list for Open OnDemand
Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails


Hi Jeremy,



Thanks for implementing the SLURM support.  This minor setup issue notwithstanding, it's working great for us!



Regarding the particulars of exactly how the environment gets set up on the compute nodes, I suspect this varies considerably from site to site.  And many sites, definitely including ours, cannot change the already established behavior without considerable distress on the users' parts.



There are some different ideas in this ticket



    https://github.com/OSC/ood_core/issues/109



For us, the best solution would probably be a way to shim in a wrapper like



    env -i HOME=~ bash -l -c 'sbatch yada yada...'



so that we can make job startup under OOD act as similarly as possible job startup without OOD.



Lacking that, or perhaps in addition, it'd also be nice to have a fix for script.job_environment parsing so that the 'ALL' (and 'NONE') keywords are supported, rather than hacking with the 'native' feature.



Anyway, with a bit of hacking, we're going here.



Mike





From: Jeremy Nicklas <jeremywnicklas at gmail.com>
Sent: Saturday, December 8, 2018 03:51 PM
To: John-Paul Robinson <jprorama at gmail.com>; User support mailing list for Open OnDemand <ood-users at lists.osc.edu>
Cc: Michael Coleman <mcolema5 at uoregon.edu>
Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails



Hi all,



Original developer of the Slurm adapter in Open OnDemand here. I want first clear up any confusion about environment variables here. Behind the scenes when OnDemand calls the `sbatch` command to submit a job it actually:



- sets the environment variable `SBATCH_EXPORT=NONE`

- adds the command line argument `--parsable` for easier parsing of the job id



The relevant snippet of code can be seen here:



https://github.com/OSC/ood_core/blob/866db98e3b3ba574a5799a3af84d2b5ffc010ff8/lib/ood_core/job/adapters/slurm.rb#L128-L138



I purposely set the `SBATCH_EXPORT=NONE` for a few reasons:



- as these are Passenger apps running below a `sudo` user switch, the environment is already drastically different from the user's login shell environment when they call `sbatch`

- the library paths LD_LIBRARY_PATH and binary paths PATH could be different between the web node and the cluster node

- if you use `SBATCH_EXPORT=NONE`, the Slurm documentation states:

  > Slurm will then implicitly attempt to load the user's environment on the node where the script is being executed



  which is what we would want.



I do not recommend setting this environment variable to `ALL` for the reasons listed above. It may cause more headaches down the road.



Now I am not entirely sure why you are seeing this issue after updating Slurm, but taking a look at the NEWS file for Slurm, one particular bugfix stands out:



https://github.com/SchedMD/slurm/blob/d128eb21583fe80b808ce64667cd89d209cd62ab/NEWS#L532



That particular commit in 17.11.11 fixes an issue in `src/common/env.c` which was introduced in 17.11.10. The exact commit is here:



https://github.com/SchedMD/slurm/commit/72b2355ca8d6f4381b4e417f76649712881f45b7



I am not sure if this is what you are experiencing. But could you test OnDemand using Slurm 17.11.11?



- Jeremy Nicklas



On Sat, Dec 8, 2018 at 5:33 PM John-Paul Robinson via OOD-users <ood-users at lists.osc.edu<mailto:ood-users at lists.osc.edu>> wrote:

Ok.



My experience has been the behavior with the second export below and not the third. That is, spec an env and you only get those specific vars. Don’t spec any and you get the ALL behavior.



What’s odd is that desktop jobs did work before and now don’t, suggesting slurm has changed wrt the ood assumptions.



Leaving the question, where are those ood assumptions made and can we just add ALL to the front of the list?



Is the conclusion of the github issue?

On Dec 8, 2018, at 1:54 PM, Michael Coleman <mcolema5 at uoregon.edu<mailto:mcolema5 at uoregon.edu>> wrote:

The key change is what happens if you use a flag like



    --export=MYVAR1,MYVAR2



The behavior mentioned in the man page (see https://github.com/OSC/ood_core/issues/109) says that no other environment variables will be passed.  In other words, this is equivalent to



    --export=NONE,MYVAR1,MYVAR2



The recent fix actually makes sbatch behave this way.  Previously, the implementation of this case was incorrectly equivalent to



    --export=ALL,MYVAR1,MYVAR2



And just to make us all completely insane, it looks like they're changing this further in a newer release (that we don't yet use at our site).  See the new man page here:



    https://slurm.schedmd.com/sbatch.html



Mike

Michael Coleman (mcolema5 at uoregon.edu<mailto:mcolema5 at uoregon.edu>), Computational Scientist

Research Advanced Computing Services

6235 University of Oregon

Eugene, OR 97403



________________________________

From: John-Paul Robinson <jprorama at gmail.com<mailto:jprorama at gmail.com>>
Sent: Saturday, December 8, 2018 10:54
To: Michael Coleman
Cc: User support mailing list for Open OnDemand
Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails



I gotta say, I am trying to read it very very carefully but don’t seem to be able to parse out the subtlety. 😐



Not sure how to miss read this, emphasis on all theirs:





--export=<environment variables [ALL] | NONE>

Identify which environment variables from the submission environment are propagated to the launched application. By default, all are propagated. Multiple environment variable names should be comma separated.





On Dec 7, 2018, at 10:24 PM, Michael Coleman <mcolema5 at uoregon.edu<mailto:mcolema5 at uoregon.edu>> wrote:

If you read the man page very, very carefully, I think the new behavior actually matches what it always said.  The old behavior didn't actually match.  But I definitely agree that it's a pretty significant change.

Michael Coleman (mcolema5 at uoregon.edu<mailto:mcolema5 at uoregon.edu>), Computational Scientist
Research Advanced Computing Services
6235 University of Oregon
Eugene, OR 97403


________________________________________
From: John-Paul Robinson <jprorama at gmail.com<mailto:jprorama at gmail.com>>
Sent: Friday, December 7, 2018 17:19
To: Michael Coleman
Cc: User support mailing list for Open OnDemand
Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails

I’m also scratching my head because documentation (even for the 18 release) still says the default of export is to propagate all environment variables.  :/



On Dec 7, 2018, at 6:59 PM, John-Paul Robinson <jprorama at gmail.com<mailto:jprorama at gmail.com>> wrote:



Its a pretty alarming change to have happened in three ticks of the minor release number.



Any insights from others on this?



On Dec 7, 2018, at 6:37 PM, Michael Coleman <mcolema5 at uoregon.edu<mailto:mcolema5 at uoregon.edu>> wrote:



Hi John-Paul,



As you say, I believe the key event was the transition in SLURM versions.  They apparently made a change in the behavior of export of environment variables from the submitting environment to the job environment (the stuff --export controls).  There was much wailing of our users here when their sbatch scripts broke as a result.  Generally, the "fix" was simply for users that were already using the --export flag to add the "ALL" keyword to that list, which seemed to restore the old behavior.



Ultimately, OOD is calling 'sbatch' to create jobs, and this change affects the environment those jobs see.  At least in our environment, the --export=ALL flag seems to cure OOD issues.  There are probably other ways to change things, but this seemed the simplest.



Good luck,

Mike





-----Original Message-----

From: John-Paul Robinson <jprorama at gmail.com<mailto:jprorama at gmail.com>>

Sent: Friday, December 7, 2018 03:56 PM

To: Michael Coleman <mcolema5 at uoregon.edu<mailto:mcolema5 at uoregon.edu>>; User support mailing list for Open OnDemand <ood-users at lists.osc.edu<mailto:ood-users at lists.osc.edu>>

Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails



MIke,



Thanks for pointing us to this issue.



This does appear to be similar to what's happening in our dev

environment.  (Note our still-working prod environment is Bright CM with

slurm 17.02.2).



The odd thing with our dev environment (built on OpenHPC) is that it was

working in October and only started failing in builds over the past

month.  This appears to coincide with the OpenHPC 1.3.5 to 1.3.6 update

(going from slurm 17.11.7 to 17.11.10).



We've had some success in restoring the original working configuration

in one of our test stacks by reverting to the OpenHPC 1.3.5 release.



What's odd is this implies the problem is not with OOD but in the

OpenHPC system env.  As far as we can determine, our OOD remains

identical.  We are setting up dev in vagrant with ansible provisioning

the openhpc + ood cluster based on the CRI_XSEDE work extended to add an

OOD node via vagrant + ansible (not as a warewulf provision).



https://github.com/jprorama/CRI_XCBC



I've read through the github issue below but haven't teased out all the

details.



Is there an obvious transition point where this export behavior could be

impacted by the underlying system versions OOD is running on?



We'll contribute insights on the github issue as we find them.



Thanks,



John-Paul





On 12/5/18 5:15 PM, Michael Coleman wrote:

Hi John-Paul,



We worked through something similar.  You might find some useful hints on this ticket.



  https://github.com/OSC/ood_core/issues/109



Cheers,

Mike









-----Original Message-----

From: OOD-users <ood-users-bounces+mcolema5=uoregon.edu at lists.osc.edu<mailto:ood-users-bounces+mcolema5=uoregon.edu at lists.osc.edu>> On Behalf Of John-Paul Robinson via OOD-users

Sent: Wednesday, December 5, 2018 02:49 PM

To: ood-users at lists.osc.edu<mailto:ood-users at lists.osc.edu>

Subject: [OOD-users] no environment set for HPC desktop -- job fails



In our dev environment (slurm with ohpc) we have started to see this

error when trying to launch interactive desktops:



/tmp/slurmd/job00079/slurm_script: line 3: module: command not found

Setting VNC password...

Error: no HOME environment variable

Starting VNC server...

vncserver: The HOME environment variable is not set.

vncserver: The HOME environment variable is not set.

vncserver: The HOME environment variable is not set. vncserver: The HOME

environment variable is





As we understand it, the PUN nginx worker launches the batch job that

starts the desktop batch job.



The problem seems to be that the environment for the job is empty, hence

no module function or HOME env or anything else.   We checked the env of

the users nginx worker under /proc and it is completely empty.   Because

our job env is inherited from the caller (the nginx worker in this case)

the attempt to run the module command and vncserver commands naturally fail.



When we launch an interactive terminal, it runs just fine, but I'm

guessing that's because the interactive session actually reads the

normal shell startup and builds its environment, even if it happened to

be missing in the proxy.



Do you have any pointers on what could cause this situation.   We

noticed it after we started adding additional interactive apps but don't

have a clear time point.  It was working fine originally and still

functions fine in our prod env (without any of the additional

interactive apps).



Thanks,



John-Paul



_______________________________________________

OOD-users mailing list

OOD-users at lists.osc.edu<mailto:OOD-users at lists.osc.edu>

https://lists.osu.edu/mailman/listinfo/ood-users



_______________________________________________
OOD-users mailing list
OOD-users at lists.osc.edu<mailto:OOD-users at lists.osc.edu>
https://lists.osu.edu/mailman/listinfo/ood-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/ood-users/attachments/20181211/d5839d51/attachment-0001.html>


More information about the OOD-users mailing list