Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.21.0 Status 500 when PATCHing /settings?tx=bottlerocket-launch #4135

Closed
EthanKane-FD opened this issue Aug 9, 2024 · 17 comments
Closed

1.21.0 Status 500 when PATCHing /settings?tx=bottlerocket-launch #4135

EthanKane-FD opened this issue Aug 9, 2024 · 17 comments
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@EthanKane-FD
Copy link

EthanKane-FD commented Aug 9, 2024

Hey there, we noticed an issue today with the latest version of bottlerocket. Any help would be greatly appreciated. Our new builds picked up the latest version and our nodes are failing to boot.
Image I'm using:
Bottlerocket OS 1.21.0

What I expected to happen:

         Starting Bottlerocket userdata configuration system...

[  OK  ] Finished Bottlerocket userdata configuration system.

What actually happened:
Bottlerocket AMI updated last night to (Bottlerocket OS 1.21.0 (aws-k8s-1.30)!) bottlerocket userdata configuration is failing.

Seeing the following in the system logs

         Starting Bottlerocket userdata configuration system...

[    3.428743] early-boot-config[1329]: Error PATCHing '/settings?tx=bottlerocket-launch': Status 500 when PATCHing /settings?tx=bottlerocket-launch: Error serializing Settings: 'unit' not allowed by Serializer
[FAILED] Failed to start Bottlerocket userdata configuration system.

See 'systemctl status early-boot-config.service' for details.

[DEPEND] Dependency failed for Bottlerocket initial configuration complete.

[DEPEND] Dependency failed for Isolates configured.target.

[DEPEND] Dependency failed for Applies settings to create config files.

[DEPEND] Dependency failed for Send signal to CloudFormation Stack.

[DEPEND] Dependency failed for Sets the hostname.

[DEPEND] Dependency failed for User-specified setting generators.

[DEPEND] Dependency failed for Generate additional settings for Kubernetes.

How to reproduce the problem:

Upgrade from 1.20.5

@EthanKane-FD EthanKane-FD added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Aug 9, 2024
@patkinson01
Copy link

We've just seen issues on some of our clusters trying to update to 1.21.0 too - seems a similar issue so pasting here - but if not let me know and I'll raise a separate ticket:

    Starting Generate additional settings for Kubernetes...

[ 7.882539] pluto[1498]: Unable to retrieve cluster name and AWS region from Bottlerocket API: Deserialization of configuration file failed: invalid type: sequence, expected a string at line 16 column 18
[FAILED] Failed to start Generate additional settings for Kubernetes.

See 'systemctl status pluto.service' for details.

[DEPEND] Dependency failed for Applies settings to create config files.

[DEPEND] Dependency failed for Sets the hostname.

[DEPEND] Dependency failed for Send signal to CloudFormation Stack.

[DEPEND] Dependency failed for Bottlerocket initial configuration complete.

[DEPEND] Dependency failed for Isolates configured.target.

@ramseymcgrathfd
Copy link

ramseymcgrathfd commented Aug 9, 2024

Example launch template to reproduce

"image-gc-high-threshold-percent" = "${config.image_gc_high_threshold_percent}"
"image-gc-low-threshold-percent"  = "${config.image_gc_low_threshold_percent}"
"eviction-max-pod-grace-period"   = "${config.max_pod_grace_period}"

[settings.kubernetes.node-labels]
%{ for label_key, label_value in config.labels }
"${label_key}" = "${label_value}"
%{ endfor ~}

[settings.kubernetes.node-taints]
%{ for taint_key, taint_value in config.taints }
"${taint_key}" = "${taint_value}"
%{ endfor ~}

[settings.kubernetes.credential-providers.ecr-credential-provider]
enabled = true
cache-duration = "30m"
image-patterns = [
  "*.dkr.ecr.*.amazonaws.com"
]

[settings.kubernetes.eviction-hard]
%{ for key, value in config.eviction_hard_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.eviction-soft]
%{ for key, value in config.eviction_soft_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.eviction-soft-grace-period]
%{ for key, value in config.soft_grace_period_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.system-reserved]
cpu = "${config.system_reserved_cpu}"
memory = "${config.system_reserved_memory}"
ephemeral-storage = "${config.system_reserved_ephemeral}"

[settings.metrics]
# whether or not health metrics will be sent. set to false to opt-out
send-metrics = false

# Use local aws time server
[settings.ntp]
time-servers = ["169.254.169.123"]

# The admin host container provides SSH access and runs with "superpowers".
# It is disabled by default, but can be disabled explicitly.
[settings.host-containers.admin]
enabled = false

# The control host container provides out-of-band access via SSM.
# It is enabled by default, and can be disabled if you do not
# expect to use SSM. This could leave you with no way to access
# the API and change settings on an existing node!
[settings.host-containers.control]
enabled = true

@yeazelm
Copy link
Contributor

yeazelm commented Aug 9, 2024

Thank you @EthanKane-FD, @ramseymcgrathfd, and @patkinson01 for reporting this! We are looking at this now and will provide an update as soon as possible.

@yeazelm
Copy link
Contributor

yeazelm commented Aug 9, 2024

For folks that have seen this issue, if you can include the userdata to reproduce, similar to @ramseymcgrathfd, that would help a ton, if you don't want to post to GitHub but can open an AWS Support case and provide it there, that would help too.

@EthanKane-FD
Copy link
Author

Hey @yeazelm, thanks for checking. Me and @ramseymcgrathfd are on the same team so that's our user data config.

@patkinson01
Copy link

patkinson01 commented Aug 9, 2024

Hi @yeazelm, please find below our userdata:

`[settings.network]
no-proxy = ${no_proxy}
https-proxy = "${http_proxy}" # Squid Proxy with access to only specific approved domains

[[settings.container-registry.credentials]]
registry = "${repo_url}" # Internal repo where we pull all images from (except for some managed addons which need to come from AWS ECR repos)
username = "${repo_username}"
password = "${repo_api_key}"

[settings.kernel.sysctl]
"user.max_user_namespaces" = "0"
"vm.max_map_count" = "262144"
"net.ipv4.conf.all.send_redirects" = "0" #cis hardening 3.1.1
"net.ipv4.conf.default.send_redirects" = "0" #cis hardening 3.1.1
"net.ipv4.conf.all.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv4.conf.default.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv6.conf.all.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv6.conf.default.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv4.conf.all.secure_redirects" = "0" #cis hardening 3.2.3
"net.ipv4.conf.default.secure_redirects" = "0" #cis hardening 3.2.3
"net.ipv4.conf.all.log_martians" = "1" #cis hardening 3.2.4
"net.ipv4.conf.default.log_martians" = "1" #cis hardening 3.2.4

[settings.kubernetes.node-labels]
"bottlerocket.aws/updater-interface-version" = "2.0.0" # Configure the node-labels Bottlerocket setting to enable BruPop updates

[settings.bootstrap-containers.bottle]
source = "${repo_url}/${bottle_rocket_repo_name}/${bottle_rocket_image_name}:${bottle_rocket_image_version}"
mode = "once"
user-data = "${user_data}" #base64 encoded set of values used in our bottlerocket bootstrap image to configure Vault access and proxy
essential = true

[settings.updates]
ignore-waves = ${bottle_rocket_update_immediately}
seed = ${bottle_rocket_seed}

[settings.kubernetes]
api-server = "${cluster_endpoint}"
cluster-certificate = "${cluster_ca_base64}"
cluster-name = "${eks_cluster_id}"`

@ytsssun
Copy link
Contributor

ytsssun commented Aug 9, 2024

Example launch template to reproduce

"image-gc-high-threshold-percent" = "${config.image_gc_high_threshold_percent}"
"image-gc-low-threshold-percent"  = "${config.image_gc_low_threshold_percent}"
"eviction-max-pod-grace-period"   = "${config.max_pod_grace_period}"

[settings.kubernetes.node-labels]
%{ for label_key, label_value in config.labels }
"${label_key}" = "${label_value}"
%{ endfor ~}

[settings.kubernetes.node-taints]
%{ for taint_key, taint_value in config.taints }
"${taint_key}" = "${taint_value}"
%{ endfor ~}

[settings.kubernetes.credential-providers.ecr-credential-provider]
enabled = true
cache-duration = "30m"
image-patterns = [
  "*.dkr.ecr.*.amazonaws.com"
]

[settings.kubernetes.eviction-hard]
%{ for key, value in config.eviction_hard_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.eviction-soft]
%{ for key, value in config.eviction_soft_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.eviction-soft-grace-period]
%{ for key, value in config.soft_grace_period_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.system-reserved]
cpu = "${config.system_reserved_cpu}"
memory = "${config.system_reserved_memory}"
ephemeral-storage = "${config.system_reserved_ephemeral}"

[settings.metrics]
# whether or not health metrics will be sent. set to false to opt-out
send-metrics = false

# Use local aws time server
[settings.ntp]
time-servers = ["169.254.169.123"]

# The admin host container provides SSH access and runs with "superpowers".
# It is disabled by default, but can be disabled explicitly.
[settings.host-containers.admin]
enabled = false

# The control host container provides out-of-band access via SSM.
# It is enabled by default, and can be disabled if you do not
# expect to use SSM. This could leave you with no way to access
# the API and change settings on an existing node!
[settings.host-containers.control]
enabled = true

@ramseymcgrathfd Do you by any chance have the rendered userdata? I tried apply some value to the template and failed to reproduce. Here is my userdata.

[settings.kubernetes]
"image-gc-high-threshold-percent" = 90
"image-gc-low-threshold-percent"  = 80
"eviction-max-pod-grace-period"   = 40

[settings.kubernetes.node-labels]
"name" = "my-node"


[settings.kubernetes.node-taints]
special = ["true:NoSchedule"]

[settings.kubernetes.credential-providers.ecr-credential-provider]
enabled = true
cache-duration = "30m"
image-patterns = [
  "*.dkr.ecr.*.amazonaws.com"
]

[settings.kubernetes.eviction-hard]
"memory.available" = "15%"

[settings.kubernetes.eviction-soft]
"memory.available" = "12%"

[settings.kubernetes.eviction-soft-grace-period]
"memory.available" = "30s"

[settings.kubernetes.system-reserved]
cpu = "10m"
ephemeral-storage = "1Gi"
memory = "100Mi"

[settings.metrics]
# whether or not health metrics will be sent. set to false to opt-out
send-metrics = false

# Use local aws time server
[settings.ntp]
time-servers = ["169.254.169.123"]

# The admin host container provides SSH access and runs with "superpowers".
# It is disabled by default, but can be disabled explicitly.
[settings.host-containers.admin]
enabled = false

# The control host container provides out-of-band access via SSM.
# It is enabled by default, and can be disabled if you do not
# expect to use SSM. This could leave you with no way to access
# the API and change settings on an existing node!
[settings.host-containers.control]
enabled = true

I was able to upgrade from v1.20.0 to v1.21.0. Using variant bottlerocket-aws-k8s-1.30-x86_64-v1.20.0.

[ssm-user@control]$ apiclient get os
{
  "os": {
    "arch": "x86_64",
    "build_id": "4d43022e",
    "pretty_name": "Bottlerocket OS 1.21.0 (aws-k8s-1.30)",
    "variant_id": "aws-k8s-1.30",
    "version_id": "1.21.0"
  }
}

@ytsssun
Copy link
Contributor

ytsssun commented Aug 9, 2024

I was able to reproduce this issue mentioned in - #4135 (comment)

My userdata

[settings.network]
no-proxy = ["localhost", "127.0.0.1"]

[settings.kernel.sysctl]
"user.max_user_namespaces" = "0"
"vm.max_map_count" = "262144"
"net.ipv4.conf.all.send_redirects" = "0" #cis hardening 3.1.1
"net.ipv4.conf.default.send_redirects" = "0" #cis hardening 3.1.1
"net.ipv4.conf.all.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv4.conf.default.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv6.conf.all.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv6.conf.default.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv4.conf.all.secure_redirects" = "0" #cis hardening 3.2.3
"net.ipv4.conf.default.secure_redirects" = "0" #cis hardening 3.2.3
"net.ipv4.conf.all.log_martians" = "1" #cis hardening 3.2.4
"net.ipv4.conf.default.log_martians" = "1" #cis hardening 3.2.4

[settings.kubernetes.node-labels]
"bottlerocket.aws/updater-interface-version" = "2.0.0" # Configure the node-labels Bottlerocket setting to enable BruPop updates

[settings.updates]
ignore-waves = true

The failure

[    3.741549] pluto[1484]: Unable to retrieve cluster name and AWS region from Bottlerocket API: Deserialization of configuration file failed: invalid type: sequence, expected a string at line 15 column 18
[FAILED] Failed to start Generate additional settings for Kubernetes.

@bcressey
Copy link
Contributor

bcressey commented Aug 9, 2024

[ 7.882539] pluto[1498]: Unable to retrieve cluster name and AWS region from Bottlerocket API: Deserialization of configuration file failed: invalid type: sequence, expected a string at line 16 column 18

This is happening because pluto only expects a String for no-proxy, when it should take a list.

@patkinson01
Copy link

patkinson01 commented Aug 9, 2024

[ 7.882539] pluto[1498]: Unable to retrieve cluster name and AWS region from Bottlerocket API: Deserialization of configuration file failed: invalid type: sequence, expected a string at line 16 column 18

This is happening because pluto only expects a String for no-proxy, when it should take a list.

Hi @bcressey , we’ve see the error during a BRUPOP initiated update and haven’t made any changes to our userdata or no_proxy value which is a string. Presumably this is something which has changed in this latest AMI then?

@bcressey
Copy link
Contributor

bcressey commented Aug 9, 2024

Hi @bcressey , we’ve see the error during a BRUPOP initiated update and haven’t made any changes to our userdata or no_proxy value which is a string. Presumably this is something which has changed in this latest AMI then?

The bug is in the newer version of pluto in 1.21.0. If you have settings.network.no-proxy defined in your settings (it's not defined by default) then it would trigger this issue on upgrade. If you don't have that setting defined then there may be another pluto bug.

@bcressey
Copy link
Contributor

bcressey commented Aug 9, 2024

[    3.428743] early-boot-config[1329]: Error PATCHing '/settings?tx=bottlerocket-launch': Status 500 when PATCHing /settings?tx=bottlerocket-launch: Error serializing Settings: 'unit' not allowed by Serializer

@sam-berning tracked this down to an issue with optional fields in the CredentialProvider structure. Omitting a field marked as optional will cause it to serialize to "null" which is then rejected by the datastore serializer.

bash-5.1# cat <<EOF > /local/user-data-defaults.toml
> [settings.kubernetes.credential-providers.ecr-credential-provider]
> enabled = true
> cache-duration = "30m"
> image-patterns = [
>   "*.dkr.ecr.*.amazonaws.com"
> ]
> EOF

bash-5.1# early-boot-config
[2024-08-09T17:52:21Z INFO  early_boot_config] early-boot-config started
[2024-08-09T17:52:21Z INFO  early_boot_config] Gathering user data providers
[2024-08-09T17:52:21Z INFO  early_boot_config] Provider '10-local-defaults': [2024-08-09T17:52:21Z INFO  early_boot_config_provider::provider] '/local/user-data-defaults.toml' exists, using it
[2024-08-09T17:52:21Z INFO  early_boot_config] Found user data via user data from /local/user-data-defaults.toml, sending to API
Error PATCHing '/settings?tx=bottlerocket-launch': Status 500 when PATCHing /settings?tx=bottlerocket-launch: Error serializing Settings: 'unit' not allowed by Serializer

Fully specifying the user data for the credential provider, by passing in a no-op environment variable, would avoid the issue:

[settings.kubernetes.credential-providers.ecr-credential-provider]
enabled = true
cache-duration = "30m"
image-patterns = [
  "*.dkr.ecr.*.amazonaws.com"
]
environment.foo = "bar"

@ramseymcgrathfd
Copy link

ramseymcgrathfd commented Aug 13, 2024

@bcressey yeah good catch, it does

reckon it'll need

    #[serde(skip_serializing_if = "Option::is_none")] 

@sam-berning
Copy link
Contributor

reckon it'll need

    #[serde(skip_serializing_if = "Option::is_none")] 

Yup, that's indeed the right fix. Should be addressed as of bottlerocket-os/bottlerocket-settings-sdk#51. We've also updated the datastore serializer to handle null values correctly in bottlerocket-os/bottlerocket-core-kit#80, which should protect against this sort of bug moving forward

@yeazelm
Copy link
Contributor

yeazelm commented Aug 27, 2024

We have released 1.21.1 that should allow a good upgrade from 1.20.5. Please let us know that it solves your problem!

@patkinson01
Copy link

All good, thanks for a quick turnaround!!

@EthanKane-FD
Copy link
Author

Hey thanks @yeazelm , have rolled this out on a few lab clusters and everything seems to be in order. Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants