[OOD-users] no environment set for HPC desktop -- job fails

John-Paul Robinson jprorama at gmail.com
Fri Dec 7 20:19:21 EST 2018


I’m also scratching my head because documentation (even for the 18 release) still says the default of export is to propagate all environment variables.  :/

> On Dec 7, 2018, at 6:59 PM, John-Paul Robinson <jprorama at gmail.com> wrote:
> 
> Its a pretty alarming change to have happened in three ticks of the minor release number. 
> 
> Any insights from others on this?
> 
>> On Dec 7, 2018, at 6:37 PM, Michael Coleman <mcolema5 at uoregon.edu> wrote:
>> 
>> Hi John-Paul,
>> 
>> As you say, I believe the key event was the transition in SLURM versions.  They apparently made a change in the behavior of export of environment variables from the submitting environment to the job environment (the stuff --export controls).  There was much wailing of our users here when their sbatch scripts broke as a result.  Generally, the "fix" was simply for users that were already using the --export flag to add the "ALL" keyword to that list, which seemed to restore the old behavior.
>> 
>> Ultimately, OOD is calling 'sbatch' to create jobs, and this change affects the environment those jobs see.  At least in our environment, the --export=ALL flag seems to cure OOD issues.  There are probably other ways to change things, but this seemed the simplest.
>> 
>> Good luck,
>> Mike
>> 
>> 
>> -----Original Message-----
>> From: John-Paul Robinson <jprorama at gmail.com> 
>> Sent: Friday, December 7, 2018 03:56 PM
>> To: Michael Coleman <mcolema5 at uoregon.edu>; User support mailing list for Open OnDemand <ood-users at lists.osc.edu>
>> Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails
>> 
>> MIke,
>> 
>> Thanks for pointing us to this issue.
>> 
>> This does appear to be similar to what's happening in our dev 
>> environment.  (Note our still-working prod environment is Bright CM with 
>> slurm 17.02.2).
>> 
>> The odd thing with our dev environment (built on OpenHPC) is that it was 
>> working in October and only started failing in builds over the past 
>> month.  This appears to coincide with the OpenHPC 1.3.5 to 1.3.6 update 
>> (going from slurm 17.11.7 to 17.11.10).
>> 
>> We've had some success in restoring the original working configuration 
>> in one of our test stacks by reverting to the OpenHPC 1.3.5 release.
>> 
>> What's odd is this implies the problem is not with OOD but in the 
>> OpenHPC system env.  As far as we can determine, our OOD remains 
>> identical.  We are setting up dev in vagrant with ansible provisioning 
>> the openhpc + ood cluster based on the CRI_XSEDE work extended to add an 
>> OOD node via vagrant + ansible (not as a warewulf provision).
>> 
>> https://github.com/jprorama/CRI_XCBC
>> 
>> I've read through the github issue below but haven't teased out all the 
>> details.
>> 
>> Is there an obvious transition point where this export behavior could be 
>> impacted by the underlying system versions OOD is running on?
>> 
>> We'll contribute insights on the github issue as we find them.
>> 
>> Thanks,
>> 
>> John-Paul
>> 
>> 
>>> On 12/5/18 5:15 PM, Michael Coleman wrote:
>>> Hi John-Paul,
>>> 
>>> We worked through something similar.  You might find some useful hints on this ticket.
>>> 
>>>    https://github.com/OSC/ood_core/issues/109
>>> 
>>> Cheers,
>>> Mike
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: OOD-users <ood-users-bounces+mcolema5=uoregon.edu at lists.osc.edu> On Behalf Of John-Paul Robinson via OOD-users
>>> Sent: Wednesday, December 5, 2018 02:49 PM
>>> To: ood-users at lists.osc.edu
>>> Subject: [OOD-users] no environment set for HPC desktop -- job fails
>>> 
>>> In our dev environment (slurm with ohpc) we have started to see this
>>> error when trying to launch interactive desktops:
>>> 
>>> /tmp/slurmd/job00079/slurm_script: line 3: module: command not found
>>> Setting VNC password...
>>> Error: no HOME environment variable
>>> Starting VNC server...
>>> vncserver: The HOME environment variable is not set.
>>> vncserver: The HOME environment variable is not set.
>>> vncserver: The HOME environment variable is not set. vncserver: The HOME
>>> environment variable is
>>> 
>>> 
>>> As we understand it, the PUN nginx worker launches the batch job that
>>> starts the desktop batch job.
>>> 
>>> The problem seems to be that the environment for the job is empty, hence
>>> no module function or HOME env or anything else.   We checked the env of
>>> the users nginx worker under /proc and it is completely empty.   Because
>>> our job env is inherited from the caller (the nginx worker in this case)
>>> the attempt to run the module command and vncserver commands naturally fail.
>>> 
>>> When we launch an interactive terminal, it runs just fine, but I'm
>>> guessing that's because the interactive session actually reads the
>>> normal shell startup and builds its environment, even if it happened to
>>> be missing in the proxy.
>>> 
>>> Do you have any pointers on what could cause this situation.   We
>>> noticed it after we started adding additional interactive apps but don't
>>> have a clear time point.  It was working fine originally and still
>>> functions fine in our prod env (without any of the additional
>>> interactive apps).
>>> 
>>> Thanks,
>>> 
>>> John-Paul
>>> 
>>> _______________________________________________
>>> OOD-users mailing list
>>> OOD-users at lists.osc.edu
>>> https://lists.osu.edu/mailman/listinfo/ood-users
>> 


More information about the OOD-users mailing list