feat: run accelerated unattended-upgrade at node creation time #4217

jackfrancis · 2021-02-02T22:02:29Z

Reason for Change:

This PR adds a runUnattendedUpgradesOnBootstrap option to the linuxProfile api model configuration, to allow folks to explicitly accelerate the acceptance of new downstream packages on node VMs when bringing them online.

In practice this will slow down node creation time, and will require extra post-installation validation as any installed packages that were not already present on the AKS Engine-curated VHD will not have been tested (this assumes you're using one of those VHDs).

Fixes #4156

Issue Fixed:

Credit Where Due:

Does this change contain code from or inspired by another project?

No
Yes

If "Yes," did you notify that project's maintainers and provide attribution?

No
Yes

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Notes:

jackfrancis · 2021-02-02T22:08:41Z

parts/k8s/cloud-init/artifacts/cse_main.sh

@@ -276,6 +276,10 @@ if [[ $OS == $UBUNTU_OS_NAME ]]; then
 fi
 {{end}}

+{{- if RunUnattendedUpgrades}}
+apt_get_update && apt_get_dist_upgrade && unattended_upgrade


In practice (I think) the unattended_upgrade invocation here is superfluous (update and dist-upgrade will effectively do the deed; including it here to be extra explicit.

perhaps @Michael-Sinz can confirm if this is sane

Mainly I trust our apt_get_update and apt_get_dist_upgrade functions to definitively accomplish those tasks over silently calling /usr/bin/unattended-upgrade. The latter (by design) silently fails single invocations (because it knows it'll be invoked again — it's not in a rush) if, for example, various apt locks are held (there are probably other reasons).

The big difference between unattended-upgrades and apt-get dist-upgrade is the list of things it will install.

Unattended upgrades is constrained to the list of updates that are deemed safe and vital for security/reliability. They are not minor feature updates unless that was required for security. (This is the default and recommended configuration for unattended-upgrade)

For example, on a test VM, I just logged in and noticed this right now:

58 packages can be updated. 4 updates are security updates.

After running unattended-upgrades on that machine (which normally cron does for me on regular basis), the login looks like this:

54 packages can be updated. 0 updates are security updates.

This is very different from a full apt-get update/apt-get upgrade (which itself is less than apt-get dist-upgrade)

The actual ubuntu unattended-upgrade command will return an error if it fails to complete an update. But it is constrained to the security updates.

Another good thing about unattended-upgrades is that it does set the unattended settings for apt/apt-get/dpkg such that it should not hang (albeit, packages can still cause this problems but that is rare in the security patches).

Which to use is really a question of risks. Balancing all of them.

We run unattended-upgrade on a regular basis because we can trust it at scale.

PS - It is redundant to run unattended-upgrade after having done the full upgrade or dist-upgrade.

It may be useful to do unattended-upgrade first just to be sure they complete before getting into the larger set (both from a security standpoint and an ability to complete them)

So I would not run unattended afterwards.

This all makes sense. What's perplexing is that, in practice, simply adding a "wait for apt locks and then run unattended-upgrade" during CSE does not in my tests produce the expected /var/run/reboot-required (a symptom of critical security updates arriving) outcome.

I'm going to try apt-get update && unattended-upgrade next.

codecov · 2021-02-02T22:21:12Z

Codecov Report

Merging #4217 (94b7c6a) into master (805416e) will increase coverage by 0.00%.
The diff coverage is 83.33%.

@@           Coverage Diff           @@
##           master    #4217   +/-   ##
=======================================
  Coverage   73.36%   73.36%           
=======================================
  Files         135      135           
  Lines       20849    20855    +6     
=======================================
+ Hits        15296    15301    +5     
- Misses       4576     4577    +1     
  Partials      977      977

Impacted Files	Coverage Δ
pkg/api/types.go	`92.72% <ø> (ø)`
pkg/api/vlabs/types.go	`73.04% <ø> (ø)`
pkg/engine/templates_generated.go	`44.19% <ø> (ø)`
pkg/engine/template_generator.go	`68.34% <75.00%> (+0.04%)`	⬆️
pkg/api/converterfromapi.go	`95.68% <100.00%> (+<0.01%)`	⬆️
pkg/api/convertertoapi.go	`94.04% <100.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 805416e...f9ba264. Read the comment docs.

jackfrancis · 2021-02-03T19:50:32Z

parts/k8s/cloud-init/artifacts/cse_main.sh

@@ -276,6 +276,10 @@ if [[ $OS == $UBUNTU_OS_NAME ]]; then
 fi
 {{end}}

+{{- if RunUnattendedUpgrades}}
+apt_get_update && unattended_upgrade


My tests so far prove that the above works to ensure that when there are security updates available, running apt-get update and then running unattended-upgrade successfully, serially, gets them. So we can trust that the "runUnattendedUpgradesOnBootstrap" feature does the right thing and actually applies (i.e., reboots) the OS updates during cluster creation.

So, in the past I saw this not always work but it could have been timing related to when other things are set up with respect to cloudinit. This is likely a better place to do that.

Is there a reason that this would not be the default behavior?

The primary reason is the judgment that having a node reboot before first coming online offers (1) undesirable delay and (2) demonstrable loss in node bootstrap reliability.

I don't think we can avoid #1, it's definitely going to take longer most of the time for nodes to come online if they come online with a stale OS security package configuration, and if they want to come up-to-date even if it requires a reboot. <-- is always going to drag up the average node bootstrap time

I wonder about #2 though. Can we summarize the additional risk of scooping up untested packages, plus any additional risk that a VM OS won't successfully come back online?

The risk is relatively low but is not zero. We have not had an outage due to the security updates as they are vetted relatively well. The question is how bad is it to run a node without the security updates?

I am not saying someone could not opt out, but it is a question of which way we should be "safe by default" and what "safe" means.

I claim we should start here and make a change to the default after some more testing maybe.

/lgtm

acs-bot · 2021-02-03T23:26:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, Michael-Sinz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis added 2 commits February 2, 2021 13:58

feat: run accelerated unattended-upgrade at node creation time

e4d02cf

typo

508dd44

jackfrancis commented Feb 2, 2021

View reviewed changes

skip dist-upgrade

94b7c6a

jackfrancis commented Feb 3, 2021

View reviewed changes

correct docs

f9ba264

jackfrancis merged commit 8fe60fb into Azure:master Feb 3, 2021

jackfrancis deleted the cse-unattended-upgrade branch February 3, 2021 23:29

fmotrifork mentioned this pull request Feb 4, 2021

aks-engine 0.60.0 fishworks/fish-food#1203

Merged

jackfrancis mentioned this pull request Feb 4, 2021

feat: run unattended upgrades by default #4231

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: run accelerated unattended-upgrade at node creation time #4217

feat: run accelerated unattended-upgrade at node creation time #4217

jackfrancis commented Feb 2, 2021

jackfrancis Feb 2, 2021

Michael-Sinz Feb 2, 2021

Michael-Sinz Feb 2, 2021

jackfrancis Feb 3, 2021

codecov bot commented Feb 2, 2021 •

edited

Loading

jackfrancis Feb 3, 2021

Michael-Sinz Feb 3, 2021

jackfrancis Feb 3, 2021

Michael-Sinz Feb 3, 2021

Michael-Sinz Feb 3, 2021

acs-bot commented Feb 3, 2021

feat: run accelerated unattended-upgrade at node creation time #4217

feat: run accelerated unattended-upgrade at node creation time #4217

Conversation

jackfrancis commented Feb 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 2, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acs-bot commented Feb 3, 2021

codecov bot commented Feb 2, 2021 •

edited

Loading