Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue-proxy is huge #9957

Open
mattmoor opened this issue Oct 27, 2020 · 19 comments
Open

queue-proxy is huge #9957

mattmoor opened this issue Oct 27, 2020 · 19 comments
Labels
area/API API objects and controllers area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)
Milestone

Comments

@mattmoor
Copy link
Member

/area API
/area autoscale
/area networking

What version of Knative?

HEAD

Expected Behavior

queue-proxy is trivially small.

Actual Behavior

As of this morning, queue-proxy is ~55.7MB.

Steps to Reproduce the Problem

GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build ./cmd/queue
ls -l queue

Going to self-assign, but feel free to jump in and party on this with me.
/assign

@mattmoor
Copy link
Member Author

Goal: eliminate queue-proxy dependency on k8s.io/client-go/kubernetes

Two offenders:

  1. k8s.io/client-go/informers is pulled in via the informed watcher in knative.dev/pkg/configmap (unused in queue-proxy): Splitting configmap package to avoid heavy dependencies pkg#1851
  2. k8s.io/client-go/kubernetes is pulled in directly via knative.dev/pkg/metrics for the stackdriver exporter to fetch secrets.

Pulling in the above PR and commenting the kubeClient references in knative.dev/pkg/metrics yields a binary size of 42,193,344 (32% reduction in size).

@mattmoor
Copy link
Member Author

Goal: eliminate queue-proxy dependency on contrib.go.opencensus.io/exporter/stackdriver

Two offenders:

  1. knative.dev/pkg/metrics
  2. knative.dev/pkg/tracing

Commenting out the stackdriver logic in these packages further reduces the queue-proxy binary size to 35,560,653 (another 18% reduction).

cc @evankanderson

@vagababov
Copy link
Contributor

In my build opencensus also seems to pull 793066 T github.com/aws/aws-sdk-go/aws/endpoints.init which with other aws stuff is another meg.

@mattmoor
Copy link
Member Author

FWIW, that's likely included in my figures, unless it is transitively pulled in through other things. I'm measuring the overall binary size once a particular cut point in the dependency graph has been snipped.

@mattmoor
Copy link
Member Author

mattmoor commented Oct 27, 2020

Shouldn't effect the code pages, but @tcnghia also found that if we use:

GOFLAGS=-ldflags="-s -w"

It drops things another ~10MB (on top of what I measured above). Seems like the lowest hanging fruit yet.

https://blog.filippo.io/shrink-your-go-binaries-with-this-one-weird-trick/

@mattmoor
Copy link
Member Author

@evankanderson do you want to link to or summarize your thoughts on timeline to cut this dependency?

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2021
@Cynocracy
Copy link
Contributor

/remove-lifecycle stale

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2021
@Cynocracy
Copy link
Contributor

This reminded me of reading https://www.cockroachlabs.com/blog/go-file-size/

I don't know how up to date it is, but maybe it would be reasonable to consider another language if we wanted to minimize the binary?

For the stackdriver dependency, we're in the (somewhat slow) process of rolling out a queue-proxy that uses Otel to our fleet.
We should be OK to delete the support at least in queue-proxy. I don't think we've migrated knative-gcp over to OTel yet, though, so maybe it could be done using build tags?

I believe we also need to publish some docs on how to configure OTel in place of the old stackdriver support, I can try to get that done soon.

@mattmoor
Copy link
Member Author

/lifecycle frozen

Yeah, I poked through that when I first opened this.

maybe it would be reasonable to consider another language if we wanted to minimize the binary?

While I think minimizing the binary is a nice goal, I also think there are practical considerations. I'm reminded of Kubernetes debating a rewriting the pause container in assembly, but the practical implications of maintaining that vs. something slightly larger in a higher-level language wasn't worth it.

Likewise here, I don't think the goal is to have an O(KB) QP, but it'd be nice if we could keep it away from O(100MB) 😉

maybe it could be done using build tags

I already added nostackdriver to enable us to check that we don't introduce these deps in new ways (other than metrics/tracing), but nothing builds with that currently. I'd love to just rip that out when y'all are ready 🤩

@knative-prow-robot knative-prow-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 29, 2021
@Cynocracy
Copy link
Contributor

Gotcha 😃 That all makes sense to me!

While I think minimizing the binary is a nice goal, I also think there are practical considerations. I'm reminded of Kubernetes debating a rewriting the pause container in assembly, but the practical implications of maintaining that vs. something slightly larger in a higher-level language wasn't worth it.
Likewise here, I don't think the goal is to have an O(KB) QP, but it'd be nice if we could keep it away from O(100MB) 😉

Totally agree, after I wrote that I immediately thought: well, what language, then? 🚲 Maaaybe something like Rust, but that's a stretch for maintenance, though it does sound kind of fun as a proof of concept.

I already added nostackdriver to enable us to check that we don't introduce these deps in new ways (other than metrics/tracing), but nothing builds with that currently. I'd love to just rip that out when y'all are ready 🤩

Will try to keep this updated with our progress, we're close enough that I wouldn't mind seeing the stackdriver deletion in 0.21 or 0.22, wdyt? At this point I think it's fair for /us/ to take on the pain of porting that patch forward if we absolutely must 🤕

@mattmoor
Copy link
Member Author

Maaaybe something like Rust, ... it does sound kind of fun as a proof of concept.

Have you met @julz 😉

wdyt?

Defer to @evankanderson who's been tracking the broader effort most closely.

@julz
Copy link
Member

julz commented Jan 31, 2021

Maaaybe something like Rust, ... it does sound kind of fun as a proof of concept.

I mean envoy is written in c++ and has survived ok so.. rust may not be so bad (and it has worked well for linkerd)! I started a repo at http://github.com/julz/roo-proxy a while ago to experiment a bit (without much progress yet, tho!). One thing I'd really like to try is to see how much of QP we could reimplement as a wasm plug-in for envoy-- that way we could reuse all of the great work and optimisations and features in envoy and (for people who already have mesh) avoid needing a second sidecar at all.

Anyway, totally ack that there's a maintainability trade-off here, just this may be a case where it's warranted (empirically QP is very expensive for what it does), and rust is pretty ok.

@markusthoemmes
Copy link
Contributor

FWIW, just to throw my hat in here too (not the red one necessarily): I've been tinkering on a rust-based queue-proxy in my friday learning time as well, without a ton of progress either. Currently it serves mostly as a segway for me to actually learn the language, but the queue-proxy in itself is a rather complex piece of software by now, I must say 😂. It ain't trivially rewritten in a day or two.

It'd definitely be interesting to see the diff in resource consumption and performance

Maintainability trade-off... definitely!

@evankanderson
Copy link
Member

/assign

/triage accepted

It sounds like Google may be looking to sustain Stackdriver support for longer than expected... I'm going to try to switch pkg/metrics to a registration pattern, and make stackdriver, prometheus and opencensus be sub-modules which can register themselves, which should allow Google to maintain the stackdriver component outside of pkg but still link it against all the binaries.

@dprotaso dprotaso modified the milestones: Ice Box, 0.25.x Jul 15, 2021
@dprotaso
Copy link
Member

We've dropped the stackdriver exporters here: knative/pkg#2173

This has dropped the queue proxy size (on my Mac) from 50MB to 30MB - so a diff of 20MB (~40%).

There might be some additional gains to be made and I'll poke at this a bit more

@dprotaso
Copy link
Member

dprotaso commented Jul 16, 2021

So there are still gains to be made by dropping k8s.io/api/core/v1 & k8s.io/api/apps/v1` but it requires a lot of reshuffling of packages and types.

Specifically the list is

  1. Split up the networking/pkg to separate packages networking#670
  2. Move autoscaler/metric.Stat to it's own package so it doesn't pull in any autoscaling APIs
  3. Don't consume corev1.Probe in the queue proxy - but create a copy of it
  4. Move knative.dev/configmap.Parse methods to their own package - no corev1 dep
  5. Drop corev1 usage in knative.dev/pkg/metrics
    1. drop use of corev1.Secret
    2. drop use of corev1.ConfigMap - move ConfigMapWatcher to it's own package
  6. Drop corev1 usage in knative.dev/pkg/tracing
    1. drop use of corev1.ConfigMap

I tested changes 1-3 and removed tracing and metrics in order to see what the floor of completing 4-6 would be - that resulted in a queue-proxy binary size of 15MB.

There's probably a gain to be had when we removed it's copy of prometheus libs - ie. #11126

@dprotaso dprotaso modified the milestones: 0.25.x, 0.26.x Aug 26, 2021
@dprotaso dprotaso modified the milestones: 0.26.x, 0.27.x Oct 6, 2021
@dprotaso dprotaso modified the milestones: v1.0.0, v1.1.0 Nov 3, 2021
@dprotaso dprotaso modified the milestones: v1.1.0, v1.2.0 Jan 4, 2022
@dprotaso dprotaso modified the milestones: v1.2.0, v1.3.0 Feb 2, 2022
@dprotaso
Copy link
Member

/unassign @mattmoor
/unassign @evankanderson

@qiming-007
Copy link

qiming-007 commented Jan 2, 2024

Maaaybe something like Rust, ... it does sound kind of fun as a proof of concept.

I mean envoy is written in c++ and has survived ok so.. rust may not be so bad (and it has worked well for linkerd)! I started a repo at http://github.com/julz/roo-proxy a while ago to experiment a bit (without much progress yet, tho!). One thing I'd really like to try is to see how much of QP we could reimplement as a wasm plug-in for envoy-- that way we could reuse all of the great work and optimisations and features in envoy and (for people who already have mesh) avoid needing a second sidecar at all.

Anyway, totally ack that there's a maintainability trade-off here, just this may be a case where it's warranted (empirically QP is very expensive for what it does), and rust is pretty ok.

@julz
The topic is a little old, but I am wondering are there any findings/update for reimplementing QP as a wasm plug-in in envoy? The idea is promising even now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

9 participants