Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server] Investigate server event loop lag #7082

Closed
geropl opened this issue Dec 6, 2021 · 22 comments · Fixed by #7269, #7421, #7442 or #7635
Closed

[server] Investigate server event loop lag #7082

geropl opened this issue Dec 6, 2021 · 22 comments · Fixed by #7269, #7421, #7442 or #7635
Labels
aspect: performance anything related to performance component: server team: webapp Issue belongs to the WebApp team

Comments

@geropl
Copy link
Member

geropl commented Dec 6, 2021

We see an accumulating node.js event loop lag increase:
This is bad, because it is a min. (!!!) static addition to every API call.
image
(Source: Grafana)

@geropl geropl added component: server aspect: performance anything related to performance labels Dec 6, 2021
@geropl
Copy link
Member Author

geropl commented Dec 6, 2021

/schedule

@roboquat
Copy link
Contributor

roboquat commented Dec 6, 2021

@geropl: Cannot schedule issue - issue does not belong to a team. Use /team to specify one.

In response to this:

/schedule

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@geropl geropl added the team: webapp Issue belongs to the WebApp team label Dec 6, 2021
@geropl
Copy link
Member Author

geropl commented Dec 6, 2021

/schedule

@roboquat
Copy link
Contributor

roboquat commented Dec 6, 2021

@geropl: Issue scheduled in the meta team (WIP: 0)

In response to this:

/schedule

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@geropl
Copy link
Member Author

geropl commented Dec 6, 2021

/assign

@geropl
Copy link
Member Author

geropl commented Dec 8, 2021

Not fixed yet.

@geropl
Copy link
Member Author

geropl commented Dec 13, 2021

Taking CPU profiles reveals some candidates for improvements, but nothing explaining the accumulating lag, yet:

  • HostContextProviderImpl (38ms): setTimeout(..., 2000) does O(n²) sync work rather often (source). Also, uses encrypted columns, which is expensive
  • strategy.pass (7.7ms): two calls here, sees to be too expensive
  • guessGitTokenScopes
  • CodeSyncService fetch
  • spawning oci-tool (40ms)
  • prometheus metrics (17ms)

@AlexTugarev
Copy link
Member

@geropl, happy to discuss findings and provide any details on the first two.

Repository owner moved this from In Progress to Done in 🍎 WebApp Team Jan 17, 2022
@JanKoehnlein JanKoehnlein reopened this Jan 17, 2022
Repository owner moved this from Done to In Progress in 🍎 WebApp Team Jan 17, 2022
@JanKoehnlein
Copy link
Contributor

Let's not close this before the lag has disappeared.

@JanKoehnlein
Copy link
Contributor

Removed priority label, because the user impact is currently mitigated by on-call. Doesn't mean we're not working on this.

@geropl
Copy link
Member Author

geropl commented Feb 11, 2022

Dropping assignment, as I'm not actively working on this.

@geropl geropl removed their assignment Feb 11, 2022
@geropl geropl moved this from In Progress to Scheduled in 🍎 WebApp Team Feb 17, 2022
@geropl
Copy link
Member Author

geropl commented Feb 21, 2022

Update

For what has happened during the last weeks on this:

  • we identified several source of bad performance/CPU spikes/long sync blocks and merged/shipped fixes [1, 2]
    • some are still outstanding, but can be continued after blockers have been addressed by platform
  • @geropl: investigated of setTimeout/setInterval calls might be causing the slowly accumulating event loop lag. tl;dr: no, those are constant over time, or linear to #of API calls and did not accumulate (measured over the course of one day)
  • increased # of pods (and nodes, we're running on smallish n4 nodes): https://github.com/gitpod-io/ops/pull/1230
  • @AlexTugarev: merged a way to gather CPU profiles at runtime: [server] add cpu profiler #8284
  • @geropl: set timeouts on all outgoing fetch calls in server ([ts] Update network-facing libs like express (et al.), node-fetch, p-… #8312)
  • @geropl and @AlexTugarev: looked at recent (18.02) CPU profiles captured by attaching the Chrome debugger to the pod for ~5mins at runtime
    • the Chrome remote debugger does not show anything, and looks rather empty apart from known short spikes, which are partly already addressed above. Those are a problem in itself, but not the general problem we're looking at. The biggest time consumer seems to be the GC:
      image

    • opening the same profile with the new, in-page Devtools visualization shows a big of a different picture (happened by accident 🙄 ):
      image
      All long-running calls are ending in I/O bounds methods (writev, send, etc.): in the first visualization this were all shown to be blocked by the GC

    • this lead us to believe we might have a problem not (so much) with our code, but with the small VMs we're running on. We know from past experience that high I/O results in high CPU traffic/usage on VMs, especially small VMs 😬

    • might also fit with another observation: node.js self-reports image, while the image

      • maybe we're filling up runtime/kernel buffers and don't cleanup?

Suggestions:

  • (merge and deploy above changes ☝️ )
  • bigger nodes: maybe 2 x n2-standard-16 (instead of 6 x n1-standard-4) with bigger caches
  • increase the size of the node.js threadpool UV_THREADPOOL_SIZE (this is responsible for performing all I/O) from default 4 to sth higher (depending on actual # of cores/pod)

@ghuntley
Copy link
Contributor

ghuntley commented Mar 8, 2022

Received a page moments ago about eventloop lag (internal link) which self-healed.

2022-03-08_15-43-54

@geropl geropl moved this from Scheduled to Needs Design in 🍎 WebApp Team Mar 11, 2022
@geropl geropl removed the status in 🍎 WebApp Team Mar 11, 2022
@geropl
Copy link
Member Author

geropl commented Apr 14, 2022

Received a page moments ago about eventloop lag (internal link) which self-healed.

This is more of an alerting/PagerDuty artifact.

@geropl
Copy link
Member Author

geropl commented Apr 14, 2022

Closing this issue, because it represents old behavior and based on old code. It does not make sense to continue here, especially as we can always go back and read up on the history/prior investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment