[OOD-users] no environment set for HPC desktop -- job fails
John-Paul Robinson
jprorama at gmail.com
Sat Dec 8 13:54:25 EST 2018
I gotta say, I am trying to read it very very carefully but don’t seem to be able to parse out the subtlety. 😐
Not sure how to miss read this, emphasis on all theirs:
--export=<environment variables [ALL] | NONE>
Identify which environment variables from the submission environment are propagated to the launched application. By default, all are propagated. Multiple environment variable names should be comma separated.
> On Dec 7, 2018, at 10:24 PM, Michael Coleman <mcolema5 at uoregon.edu> wrote:
>
> If you read the man page very, very carefully, I think the new behavior actually matches what it always said. The old behavior didn't actually match. But I definitely agree that it's a pretty significant change.
>
> Michael Coleman (mcolema5 at uoregon.edu), Computational Scientist
> Research Advanced Computing Services
> 6235 University of Oregon
> Eugene, OR 97403
>
>
> ________________________________________
> From: John-Paul Robinson <jprorama at gmail.com>
> Sent: Friday, December 7, 2018 17:19
> To: Michael Coleman
> Cc: User support mailing list for Open OnDemand
> Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails
>
> I’m also scratching my head because documentation (even for the 18 release) still says the default of export is to propagate all environment variables. :/
>
>> On Dec 7, 2018, at 6:59 PM, John-Paul Robinson <jprorama at gmail.com> wrote:
>>
>> Its a pretty alarming change to have happened in three ticks of the minor release number.
>>
>> Any insights from others on this?
>>
>>> On Dec 7, 2018, at 6:37 PM, Michael Coleman <mcolema5 at uoregon.edu> wrote:
>>>
>>> Hi John-Paul,
>>>
>>> As you say, I believe the key event was the transition in SLURM versions. They apparently made a change in the behavior of export of environment variables from the submitting environment to the job environment (the stuff --export controls). There was much wailing of our users here when their sbatch scripts broke as a result. Generally, the "fix" was simply for users that were already using the --export flag to add the "ALL" keyword to that list, which seemed to restore the old behavior.
>>>
>>> Ultimately, OOD is calling 'sbatch' to create jobs, and this change affects the environment those jobs see. At least in our environment, the --export=ALL flag seems to cure OOD issues. There are probably other ways to change things, but this seemed the simplest.
>>>
>>> Good luck,
>>> Mike
>>>
>>>
>>> -----Original Message-----
>>> From: John-Paul Robinson <jprorama at gmail.com>
>>> Sent: Friday, December 7, 2018 03:56 PM
>>> To: Michael Coleman <mcolema5 at uoregon.edu>; User support mailing list for Open OnDemand <ood-users at lists.osc.edu>
>>> Subject: Re: [OOD-users] no environment set for HPC desktop -- job fails
>>>
>>> MIke,
>>>
>>> Thanks for pointing us to this issue.
>>>
>>> This does appear to be similar to what's happening in our dev
>>> environment. (Note our still-working prod environment is Bright CM with
>>> slurm 17.02.2).
>>>
>>> The odd thing with our dev environment (built on OpenHPC) is that it was
>>> working in October and only started failing in builds over the past
>>> month. This appears to coincide with the OpenHPC 1.3.5 to 1.3.6 update
>>> (going from slurm 17.11.7 to 17.11.10).
>>>
>>> We've had some success in restoring the original working configuration
>>> in one of our test stacks by reverting to the OpenHPC 1.3.5 release.
>>>
>>> What's odd is this implies the problem is not with OOD but in the
>>> OpenHPC system env. As far as we can determine, our OOD remains
>>> identical. We are setting up dev in vagrant with ansible provisioning
>>> the openhpc + ood cluster based on the CRI_XSEDE work extended to add an
>>> OOD node via vagrant + ansible (not as a warewulf provision).
>>>
>>> https://github.com/jprorama/CRI_XCBC
>>>
>>> I've read through the github issue below but haven't teased out all the
>>> details.
>>>
>>> Is there an obvious transition point where this export behavior could be
>>> impacted by the underlying system versions OOD is running on?
>>>
>>> We'll contribute insights on the github issue as we find them.
>>>
>>> Thanks,
>>>
>>> John-Paul
>>>
>>>
>>>> On 12/5/18 5:15 PM, Michael Coleman wrote:
>>>> Hi John-Paul,
>>>>
>>>> We worked through something similar. You might find some useful hints on this ticket.
>>>>
>>>> https://github.com/OSC/ood_core/issues/109
>>>>
>>>> Cheers,
>>>> Mike
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: OOD-users <ood-users-bounces+mcolema5=uoregon.edu at lists.osc.edu> On Behalf Of John-Paul Robinson via OOD-users
>>>> Sent: Wednesday, December 5, 2018 02:49 PM
>>>> To: ood-users at lists.osc.edu
>>>> Subject: [OOD-users] no environment set for HPC desktop -- job fails
>>>>
>>>> In our dev environment (slurm with ohpc) we have started to see this
>>>> error when trying to launch interactive desktops:
>>>>
>>>> /tmp/slurmd/job00079/slurm_script: line 3: module: command not found
>>>> Setting VNC password...
>>>> Error: no HOME environment variable
>>>> Starting VNC server...
>>>> vncserver: The HOME environment variable is not set.
>>>> vncserver: The HOME environment variable is not set.
>>>> vncserver: The HOME environment variable is not set. vncserver: The HOME
>>>> environment variable is
>>>>
>>>>
>>>> As we understand it, the PUN nginx worker launches the batch job that
>>>> starts the desktop batch job.
>>>>
>>>> The problem seems to be that the environment for the job is empty, hence
>>>> no module function or HOME env or anything else. We checked the env of
>>>> the users nginx worker under /proc and it is completely empty. Because
>>>> our job env is inherited from the caller (the nginx worker in this case)
>>>> the attempt to run the module command and vncserver commands naturally fail.
>>>>
>>>> When we launch an interactive terminal, it runs just fine, but I'm
>>>> guessing that's because the interactive session actually reads the
>>>> normal shell startup and builds its environment, even if it happened to
>>>> be missing in the proxy.
>>>>
>>>> Do you have any pointers on what could cause this situation. We
>>>> noticed it after we started adding additional interactive apps but don't
>>>> have a clear time point. It was working fine originally and still
>>>> functions fine in our prod env (without any of the additional
>>>> interactive apps).
>>>>
>>>> Thanks,
>>>>
>>>> John-Paul
>>>>
>>>> _______________________________________________
>>>> OOD-users mailing list
>>>> OOD-users at lists.osc.edu
>>>> https://lists.osu.edu/mailman/listinfo/ood-users
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/ood-users/attachments/20181208/3a5f4542/attachment-0001.html>
More information about the OOD-users
mailing list