[OOD-users] no environment set for HPC desktop -- job fails

Fri Dec 7 19:59:20 EST 2018

Its a pretty alarming change to have happened in three ticks of the minor release number. 

Any insights from others on this?

> On Dec 7, 2018, at 6:37 PM, Michael Coleman <mcolema5 at uoregon.edu> wrote:
> 
> Hi John-Paul,
> 
> As you say, I believe the key event was the transition in SLURM versions.  They apparently made a change in the behavior of export of environment variables from the submitting environment to the job environment (the stuff --export controls).  There was much wailing of our users here when their sbatch scripts broke as a result.  Generally, the "fix" was simply for users that were already using the --export flag to add the "ALL" keyword to that list, which seemed to restore the old behavior.
> 
> Ultimately, OOD is calling 'sbatch' to create jobs, and this change affects the environment those jobs see.  At least in our environment, the --export=ALL flag seems to cure OOD issues.  There are probably other ways to change things, but this seemed the simplest.
> 
> Good luck,
> Mike
> 
> 
> -----Original Message-----
> From: John-Paul Robinson <jprorama at gmail.com> 
> Sent: Friday, December 7, 2018 03:56 PM
> To: Michael Coleman <mcolema5 at uoregon.edu>; User support mailing list for Open OnDemand <ood-users at lists.osc.edu>
> Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails
> 
> MIke,
> 
> Thanks for pointing us to this issue.
> 
> This does appear to be similar to what's happening in our dev 
> environment.  (Note our still-working prod environment is Bright CM with 
> slurm 17.02.2).
> 
> The odd thing with our dev environment (built on OpenHPC) is that it was 
> working in October and only started failing in builds over the past 
> month.  This appears to coincide with the OpenHPC 1.3.5 to 1.3.6 update 
> (going from slurm 17.11.7 to 17.11.10).
> 
> We've had some success in restoring the original working configuration 
> in one of our test stacks by reverting to the OpenHPC 1.3.5 release.
> 
> What's odd is this implies the problem is not with OOD but in the 
> OpenHPC system env.  As far as we can determine, our OOD remains 
> identical.  We are setting up dev in vagrant with ansible provisioning 
> the openhpc + ood cluster based on the CRI_XSEDE work extended to add an 
> OOD node via vagrant + ansible (not as a warewulf provision).
> 
> https://github.com/jprorama/CRI_XCBC
> 
> I've read through the github issue below but haven't teased out all the 
> details.
> 
> Is there an obvious transition point where this export behavior could be 
> impacted by the underlying system versions OOD is running on?
> 
> We'll contribute insights on the github issue as we find them.
> 
> Thanks,
> 
> John-Paul
> 
> 
>> On 12/5/18 5:15 PM, Michael Coleman wrote:
>> Hi John-Paul,
>> 
>> We worked through something similar.  You might find some useful hints on this ticket.
>> 
>>     https://github.com/OSC/ood_core/issues/109
>> 
>> Cheers,
>> Mike
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: OOD-users <ood-users-bounces+mcolema5=uoregon.edu at lists.osc.edu> On Behalf Of John-Paul Robinson via OOD-users
>> Sent: Wednesday, December 5, 2018 02:49 PM
>> To: ood-users at lists.osc.edu
>> Subject: [OOD-users] no environment set for HPC desktop -- job fails
>> 
>> In our dev environment (slurm with ohpc) we have started to see this
>> error when trying to launch interactive desktops:
>> 
>> /tmp/slurmd/job00079/slurm_script: line 3: module: command not found
>> Setting VNC password...
>> Error: no HOME environment variable
>> Starting VNC server...
>> vncserver: The HOME environment variable is not set.
>> vncserver: The HOME environment variable is not set.
>> vncserver: The HOME environment variable is not set. vncserver: The HOME
>> environment variable is
>> 
>> 
>> As we understand it, the PUN nginx worker launches the batch job that
>> starts the desktop batch job.
>> 
>> The problem seems to be that the environment for the job is empty, hence
>> no module function or HOME env or anything else.   We checked the env of
>> the users nginx worker under /proc and it is completely empty.   Because
>> our job env is inherited from the caller (the nginx worker in this case)
>> the attempt to run the module command and vncserver commands naturally fail.
>> 
>> When we launch an interactive terminal, it runs just fine, but I'm
>> guessing that's because the interactive session actually reads the
>> normal shell startup and builds its environment, even if it happened to
>> be missing in the proxy.
>> 
>> Do you have any pointers on what could cause this situation.   We
>> noticed it after we started adding additional interactive apps but don't
>> have a clear time point.  It was working fine originally and still
>> functions fine in our prod env (without any of the additional
>> interactive apps).
>> 
>> Thanks,
>> 
>> John-Paul
>> 
>> _______________________________________________
>> OOD-users mailing list
>> OOD-users at lists.osc.edu
>> https://lists.osu.edu/mailman/listinfo/ood-users
>