Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kourier crashes on Raspberry Pi k3s cluster with "Out of memory trying to allocate internal tcmalloc data" #971

Closed
GJKrupa opened this issue Feb 19, 2022 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@GJKrupa
Copy link

GJKrupa commented Feb 19, 2022

Describe the bug
Kourier gateway fails to start on Raspberry Pi 4 k3s cluster. The gateway pod crashes with the following error:

external/com_github_google_tcmalloc/tcmalloc/system-alloc.cc:550] MmapAligned() failed (size, alignment) 1073741824 1073741824 @ 0x558c6fe390 0x558c6f0934 0x558c6f0374 0x558c6d9db4 0x558c6ed5a4               │
external/com_github_google_tcmalloc/tcmalloc/arena.cc:34] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size) 131072 48 @ 0x558c6fe6a0 0x558c6d9e30 0x558c6ed5a4

Expected behavior
KNative should start up with a working Kourier gateway

To Reproduce

  1. Create a Kubernetes cluster on RaspberryPi OS (64-bit lite version) using k3s v1.22.6+k3s1
  2. Install the latest operator (as of 19 Feb 2022) by running kubectl apply -f https://github.com/knative/operator/releases/latest/download/operator.yaml
  3. Apply the following Kubernetes yaml:
apiVersion: v1
kind: Namespace
metadata:
  name: knative-serving
---
apiVersion: operator.knative.dev/v1alpha1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  ingress:
    kourier:
      enabled: true
  config:
    network:
      domainTemplate: "{{.Name}}.{{.Domain}}"
      ingress-class: "kourier.ingress.networking.knative.dev"
    domain:
      example.com: ""

Knative release version
v1.2.0

Additional context
This cluster was running fine previously with KNative 0.25.1 and KNative Kourier v0.25.0. The issue appears to be a version of Envoy proxy built with a version of tmalloc that doesn't support ARM64. See google/tcmalloc#33 for details.

My cluster is made of of four Raspberry Pi 4s with 8GB RAM each. Prometheus monitoring on the cluster shows that no node is using more than 29% of its available RAM.

@GJKrupa GJKrupa added the kind/bug Categorizes issue or PR as related to a bug. label Feb 19, 2022
@GJKrupa
Copy link
Author

GJKrupa commented Feb 19, 2022

I've also tried using Contour and the same happens with the contour-internal and contour-external envoy pods from v1.2.0

@houshengbo
Copy link
Contributor

@nak3 @ZhiminXiang Do you folks know anything about this?

@nak3
Copy link
Contributor

nak3 commented Feb 23, 2022

Envoy proxy built with a version of tmalloc that doesn't support ARM64.

@GJKrupa I agree that it is an issue of Envoy w/ ARM64. I would reach out to Envoy upstream (as you have already did). I think it is not a Knative issue so there is nothing we can do for this.

@GJKrupa
Copy link
Author

GJKrupa commented Feb 23, 2022

Absolutely it's an Envoy issue but would it be possible to add an override that allows a specific Envoy image tag to be used in the same way that the ConfigMaps can be modified? Reverting to version 1.16 is an effective workaround for the problem since that version works on arm64.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 25, 2022
@GJKrupa
Copy link
Author

GJKrupa commented Jun 2, 2022

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2022
@nak3
Copy link
Contributor

nak3 commented Jul 28, 2022

Hi @GJKrupa Sorry for the delay. You can replace the envoy image with spec.registry.override like this:

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  config:
    network:
      ingress-class: kourier.ingress.networking.knative.dev
  ingress:
    kourier:
      enabled: true
  registry:
    override:
      kourier-gateway: docker.io/envoyproxy/envoy:v1.16-latest

However Envoy 1.16 is already EOL[1] and I don't recommend that you keep using the old Envoy version. I believe that it is better to fix the Envoy issue rather than using this workaround.
[1] https://github.com/envoyproxy/envoy/blob/main/RELEASES.md#major-release-schedule

@GJKrupa
Copy link
Author

GJKrupa commented Jul 28, 2022

Thanks. This is exactly what I was looking for. I'll use this same config option to switch to 1.23 once they release a fix for a different ARM64 issue: envoyproxy/envoy#22384

@GJKrupa GJKrupa closed this as completed Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants