[OOD-users] no environment set for HPC desktop -- job fails

Jeremy Nicklas jeremywnicklas at gmail.com
Sat Dec 8 18:50:58 EST 2018


Hi all,

Original developer of the Slurm adapter in Open OnDemand here. I want first
clear up any confusion about environment variables here. Behind the scenes
when OnDemand calls the `sbatch` command to submit a job it actually:

- sets the environment variable `SBATCH_EXPORT=NONE`
- adds the command line argument `--parsable` for easier parsing of the job
id

The relevant snippet of code can be seen here:

https://github.com/OSC/ood_core/blob/866db98e3b3ba574a5799a3af84d2b5ffc010ff8/lib/ood_core/job/adapters/slurm.rb#L128-L138

I purposely set the `SBATCH_EXPORT=NONE` for a few reasons:

- as these are Passenger apps running below a `sudo` user switch, the
environment is already drastically different from the user's login shell
environment when they call `sbatch`
- the library paths LD_LIBRARY_PATH and binary paths PATH could be
different between the web node and the cluster node
- if you use `SBATCH_EXPORT=NONE`, the Slurm documentation states:

  > Slurm will then implicitly attempt to load the user's environment on
the node where the script is being executed

  which is what we would want.

I do not recommend setting this environment variable to `ALL` for the
reasons listed above. It may cause more headaches down the road.

Now I am not entirely sure why you are seeing this issue after updating
Slurm, but taking a look at the NEWS file for Slurm, one particular bugfix
stands out:

https://github.com/SchedMD/slurm/blob/d128eb21583fe80b808ce64667cd89d209cd62ab/NEWS#L532

That particular commit in 17.11.11 fixes an issue in `src/common/env.c`
which was introduced in 17.11.10. The exact commit is here:

https://github.com/SchedMD/slurm/commit/72b2355ca8d6f4381b4e417f76649712881f45b7

I am not sure if this is what you are experiencing. But could you test
OnDemand using Slurm 17.11.11?

- Jeremy Nicklas

On Sat, Dec 8, 2018 at 5:33 PM John-Paul Robinson via OOD-users <
ood-users at lists.osc.edu> wrote:

> Ok.
>
> My experience has been the behavior with the second export below and not
> the third. That is, spec an env and you only get those specific vars. Don’t
> spec any and you get the ALL behavior.
>
> What’s odd is that desktop jobs did work before and now don’t, suggesting
> slurm has changed wrt the ood assumptions.
>
> Leaving the question, where are those ood assumptions made and can we just
> add ALL to the front of the list?
>
> Is the conclusion of the github issue?
>
> On Dec 8, 2018, at 1:54 PM, Michael Coleman <mcolema5 at uoregon.edu> wrote:
>
> The key change is what happens if you use a flag like
>
>
>     --export=MYVAR1,MYVAR2
>
>
> The behavior mentioned in the man page (see
> https://github.com/OSC/ood_core/issues/109) says that no other
> environment variables will be passed.  In other words, this is equivalent to
>
>
>     --export=NONE,MYVAR1,MYVAR2
>
>
> The recent fix actually makes sbatch behave this way.  Previously, the
> implementation of this case was incorrectly equivalent to
>
>
>     --export=ALL,MYVAR1,MYVAR2
>
>
> And just to make us all completely insane, it looks like they're changing
> this further in a newer release (that we don't yet use at our site).  See
> the new man page here:
>
>
>     https://slurm.schedmd.com/sbatch.html
>
>
> Mike
>
>>
> Michael Coleman (mcolema5 at uoregon.edu), Computational Scientist
>
> Research Advanced Computing Services
> 6235 University of Oregon
> Eugene, OR 97403
>
>
> ------------------------------
> *From:* John-Paul Robinson <jprorama at gmail.com>
> *Sent:* Saturday, December 8, 2018 10:54
> *To:* Michael Coleman
> *Cc:* User support mailing list for Open OnDemand
> *Subject:* Re: [OOD-users] no environment set for HPC desktop -- job fails
>
> I gotta say, I am trying to read it very very carefully but don’t seem to
> be able to parse out the subtlety. 😐
>
> Not sure how to miss read this, emphasis on all theirs:
>
>
> *--export*=<environment variables [ALL] | NONE>Identify which environment
> variables from the submission environment are propagated to the launched
> application. By default, all are propagated. Multiple environment
> variable names should be comma separated.
>
>
>
> On Dec 7, 2018, at 10:24 PM, Michael Coleman <mcolema5 at uoregon.edu> wrote:
>
> If you read the man page very, very carefully, I think the new behavior
> actually matches what it always said.  The old behavior didn't actually
> match.  But I definitely agree that it's a pretty significant change.
>
> Michael Coleman (mcolema5 at uoregon.edu), Computational Scientist
> Research Advanced Computing Services
> 6235 University of Oregon
> Eugene, OR 97403
>
>
> ________________________________________
> From: John-Paul Robinson <jprorama at gmail.com>
> Sent: Friday, December 7, 2018 17:19
> To: Michael Coleman
> Cc: User support mailing list for Open OnDemand
> Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails
>
> I’m also scratching my head because documentation (even for the 18
> release) still says the default of export is to propagate all environment
> variables.  :/
>
> On Dec 7, 2018, at 6:59 PM, John-Paul Robinson <jprorama at gmail.com> wrote:
>
>
> Its a pretty alarming change to have happened in three ticks of the minor
> release number.
>
>
> Any insights from others on this?
>
>
> On Dec 7, 2018, at 6:37 PM, Michael Coleman <mcolema5 at uoregon.edu> wrote:
>
>
> Hi John-Paul,
>
>
> As you say, I believe the key event was the transition in SLURM versions.
> They apparently made a change in the behavior of export of environment
> variables from the submitting environment to the job environment (the stuff
> --export controls).  There was much wailing of our users here when their
> sbatch scripts broke as a result.  Generally, the "fix" was simply for
> users that were already using the --export flag to add the "ALL" keyword to
> that list, which seemed to restore the old behavior.
>
>
> Ultimately, OOD is calling 'sbatch' to create jobs, and this change
> affects the environment those jobs see.  At least in our environment, the
> --export=ALL flag seems to cure OOD issues.  There are probably other ways
> to change things, but this seemed the simplest.
>
>
> Good luck,
>
> Mike
>
>
>
> -----Original Message-----
>
> From: John-Paul Robinson <jprorama at gmail.com>
>
> Sent: Friday, December 7, 2018 03:56 PM
>
> To: Michael Coleman <mcolema5 at uoregon.edu>; User support mailing list for
> Open OnDemand <ood-users at lists.osc.edu>
>
> Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails
>
>
> MIke,
>
>
> Thanks for pointing us to this issue.
>
>
> This does appear to be similar to what's happening in our dev
>
> environment.  (Note our still-working prod environment is Bright CM with
>
> slurm 17.02.2).
>
>
> The odd thing with our dev environment (built on OpenHPC) is that it was
>
> working in October and only started failing in builds over the past
>
> month.  This appears to coincide with the OpenHPC 1.3.5 to 1.3.6 update
>
> (going from slurm 17.11.7 to 17.11.10).
>
>
> We've had some success in restoring the original working configuration
>
> in one of our test stacks by reverting to the OpenHPC 1.3.5 release.
>
>
> What's odd is this implies the problem is not with OOD but in the
>
> OpenHPC system env.  As far as we can determine, our OOD remains
>
> identical.  We are setting up dev in vagrant with ansible provisioning
>
> the openhpc + ood cluster based on the CRI_XSEDE work extended to add an
>
> OOD node via vagrant + ansible (not as a warewulf provision).
>
>
> https://github.com/jprorama/CRI_XCBC
>
>
> I've read through the github issue below but haven't teased out all the
>
> details.
>
>
> Is there an obvious transition point where this export behavior could be
>
> impacted by the underlying system versions OOD is running on?
>
>
> We'll contribute insights on the github issue as we find them.
>
>
> Thanks,
>
>
> John-Paul
>
>
>
> On 12/5/18 5:15 PM, Michael Coleman wrote:
>
> Hi John-Paul,
>
>
> We worked through something similar.  You might find some useful hints on
> this ticket.
>
>
>   https://github.com/OSC/ood_core/issues/109
>
>
> Cheers,
>
> Mike
>
>
>
>
>
> -----Original Message-----
>
> From: OOD-users <ood-users-bounces+mcolema5=uoregon.edu at lists.osc.edu> On
> Behalf Of John-Paul Robinson via OOD-users
>
> Sent: Wednesday, December 5, 2018 02:49 PM
>
> To: ood-users at lists.osc.edu
>
> Subject: [OOD-users] no environment set for HPC desktop -- job fails
>
>
> In our dev environment (slurm with ohpc) we have started to see this
>
> error when trying to launch interactive desktops:
>
>
> /tmp/slurmd/job00079/slurm_script: line 3: module: command not found
>
> Setting VNC password...
>
> Error: no HOME environment variable
>
> Starting VNC server...
>
> vncserver: The HOME environment variable is not set.
>
> vncserver: The HOME environment variable is not set.
>
> vncserver: The HOME environment variable is not set. vncserver: The HOME
>
> environment variable is
>
>
>
> As we understand it, the PUN nginx worker launches the batch job that
>
> starts the desktop batch job.
>
>
> The problem seems to be that the environment for the job is empty, hence
>
> no module function or HOME env or anything else.   We checked the env of
>
> the users nginx worker under /proc and it is completely empty.   Because
>
> our job env is inherited from the caller (the nginx worker in this case)
>
> the attempt to run the module command and vncserver commands naturally
> fail.
>
>
> When we launch an interactive terminal, it runs just fine, but I'm
>
> guessing that's because the interactive session actually reads the
>
> normal shell startup and builds its environment, even if it happened to
>
> be missing in the proxy.
>
>
> Do you have any pointers on what could cause this situation.   We
>
> noticed it after we started adding additional interactive apps but don't
>
> have a clear time point.  It was working fine originally and still
>
> functions fine in our prod env (without any of the additional
>
> interactive apps).
>
>
> Thanks,
>
>
> John-Paul
>
>
> _______________________________________________
>
> OOD-users mailing list
>
> OOD-users at lists.osc.edu
>
> https://lists.osu.edu/mailman/listinfo/ood-users
>
>
> _______________________________________________
> OOD-users mailing list
> OOD-users at lists.osc.edu
> https://lists.osu.edu/mailman/listinfo/ood-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/ood-users/attachments/20181208/cf1f3239/attachment-0001.html>


More information about the OOD-users mailing list