[OOD-users] no environment set for HPC desktop -- job fails

John-Paul Robinson jprorama at gmail.com
Sat Dec 8 13:54:25 EST 2018


I gotta say, I am trying to read it very very carefully but don’t seem to be able to parse out the subtlety. 😐

Not sure how to miss read this, emphasis on all theirs:


--export=<environment variables [ALL] | NONE>
Identify which environment variables from the submission environment are propagated to the launched application. By default, all are propagated. Multiple environment variable names should be comma separated. 



> On Dec 7, 2018, at 10:24 PM, Michael Coleman <mcolema5 at uoregon.edu> wrote:
> 
> If you read the man page very, very carefully, I think the new behavior actually matches what it always said.  The old behavior didn't actually match.  But I definitely agree that it's a pretty significant change.
> 
> Michael Coleman (mcolema5 at uoregon.edu), Computational Scientist
> Research Advanced Computing Services
> 6235 University of Oregon
> Eugene, OR 97403
> 
> 
> ________________________________________
> From: John-Paul Robinson <jprorama at gmail.com>
> Sent: Friday, December 7, 2018 17:19
> To: Michael Coleman
> Cc: User support mailing list for Open OnDemand
> Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails
> 
> I’m also scratching my head because documentation (even for the 18 release) still says the default of export is to propagate all environment variables.  :/
> 
>> On Dec 7, 2018, at 6:59 PM, John-Paul Robinson <jprorama at gmail.com> wrote:
>> 
>> Its a pretty alarming change to have happened in three ticks of the minor release number.
>> 
>> Any insights from others on this?
>> 
>>> On Dec 7, 2018, at 6:37 PM, Michael Coleman <mcolema5 at uoregon.edu> wrote:
>>> 
>>> Hi John-Paul,
>>> 
>>> As you say, I believe the key event was the transition in SLURM versions.  They apparently made a change in the behavior of export of environment variables from the submitting environment to the job environment (the stuff --export controls).  There was much wailing of our users here when their sbatch scripts broke as a result.  Generally, the "fix" was simply for users that were already using the --export flag to add the "ALL" keyword to that list, which seemed to restore the old behavior.
>>> 
>>> Ultimately, OOD is calling 'sbatch' to create jobs, and this change affects the environment those jobs see.  At least in our environment, the --export=ALL flag seems to cure OOD issues.  There are probably other ways to change things, but this seemed the simplest.
>>> 
>>> Good luck,
>>> Mike
>>> 
>>> 
>>> -----Original Message-----
>>> From: John-Paul Robinson <jprorama at gmail.com>
>>> Sent: Friday, December 7, 2018 03:56 PM
>>> To: Michael Coleman <mcolema5 at uoregon.edu>; User support mailing list for Open OnDemand <ood-users at lists.osc.edu>
>>> Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails
>>> 
>>> MIke,
>>> 
>>> Thanks for pointing us to this issue.
>>> 
>>> This does appear to be similar to what's happening in our dev
>>> environment.  (Note our still-working prod environment is Bright CM with
>>> slurm 17.02.2).
>>> 
>>> The odd thing with our dev environment (built on OpenHPC) is that it was
>>> working in October and only started failing in builds over the past
>>> month.  This appears to coincide with the OpenHPC 1.3.5 to 1.3.6 update
>>> (going from slurm 17.11.7 to 17.11.10).
>>> 
>>> We've had some success in restoring the original working configuration
>>> in one of our test stacks by reverting to the OpenHPC 1.3.5 release.
>>> 
>>> What's odd is this implies the problem is not with OOD but in the
>>> OpenHPC system env.  As far as we can determine, our OOD remains
>>> identical.  We are setting up dev in vagrant with ansible provisioning
>>> the openhpc + ood cluster based on the CRI_XSEDE work extended to add an
>>> OOD node via vagrant + ansible (not as a warewulf provision).
>>> 
>>> https://github.com/jprorama/CRI_XCBC
>>> 
>>> I've read through the github issue below but haven't teased out all the
>>> details.
>>> 
>>> Is there an obvious transition point where this export behavior could be
>>> impacted by the underlying system versions OOD is running on?
>>> 
>>> We'll contribute insights on the github issue as we find them.
>>> 
>>> Thanks,
>>> 
>>> John-Paul
>>> 
>>> 
>>>> On 12/5/18 5:15 PM, Michael Coleman wrote:
>>>> Hi John-Paul,
>>>> 
>>>> We worked through something similar.  You might find some useful hints on this ticket.
>>>> 
>>>>   https://github.com/OSC/ood_core/issues/109
>>>> 
>>>> Cheers,
>>>> Mike
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: OOD-users <ood-users-bounces+mcolema5=uoregon.edu at lists.osc.edu> On Behalf Of John-Paul Robinson via OOD-users
>>>> Sent: Wednesday, December 5, 2018 02:49 PM
>>>> To: ood-users at lists.osc.edu
>>>> Subject: [OOD-users] no environment set for HPC desktop -- job fails
>>>> 
>>>> In our dev environment (slurm with ohpc) we have started to see this
>>>> error when trying to launch interactive desktops:
>>>> 
>>>> /tmp/slurmd/job00079/slurm_script: line 3: module: command not found
>>>> Setting VNC password...
>>>> Error: no HOME environment variable
>>>> Starting VNC server...
>>>> vncserver: The HOME environment variable is not set.
>>>> vncserver: The HOME environment variable is not set.
>>>> vncserver: The HOME environment variable is not set. vncserver: The HOME
>>>> environment variable is
>>>> 
>>>> 
>>>> As we understand it, the PUN nginx worker launches the batch job that
>>>> starts the desktop batch job.
>>>> 
>>>> The problem seems to be that the environment for the job is empty, hence
>>>> no module function or HOME env or anything else.   We checked the env of
>>>> the users nginx worker under /proc and it is completely empty.   Because
>>>> our job env is inherited from the caller (the nginx worker in this case)
>>>> the attempt to run the module command and vncserver commands naturally fail.
>>>> 
>>>> When we launch an interactive terminal, it runs just fine, but I'm
>>>> guessing that's because the interactive session actually reads the
>>>> normal shell startup and builds its environment, even if it happened to
>>>> be missing in the proxy.
>>>> 
>>>> Do you have any pointers on what could cause this situation.   We
>>>> noticed it after we started adding additional interactive apps but don't
>>>> have a clear time point.  It was working fine originally and still
>>>> functions fine in our prod env (without any of the additional
>>>> interactive apps).
>>>> 
>>>> Thanks,
>>>> 
>>>> John-Paul
>>>> 
>>>> _______________________________________________
>>>> OOD-users mailing list
>>>> OOD-users at lists.osc.edu
>>>> https://lists.osu.edu/mailman/listinfo/ood-users
>>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/ood-users/attachments/20181208/3a5f4542/attachment-0001.html>


More information about the OOD-users mailing list