Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bug Report: Techdocs Memory Leak #27347

Open
2 tasks done
pcgqueiroz opened this issue Oct 25, 2024 · 18 comments
Open
2 tasks done

🐛 Bug Report: Techdocs Memory Leak #27347

pcgqueiroz opened this issue Oct 25, 2024 · 18 comments
Labels
area:techdocs Related to the TechDocs Project Area bug Something isn't working

Comments

@pcgqueiroz
Copy link

📜 Description

The techdocs-backend plugin is experiencing a memory leak when integrated with the search-backend-module-techdocs. In a production setup with a dedicated pod for techdocs-backend, the /static/docs endpoint is queried by the search-backend-module-techdocs every 10 minutes. With over 3000 entities to process, memory consumption increases progressively until the JavaScript heap is exhausted, resulting in a crash.

👍 Expected behavior

The techdocs-backend plugin should handle repeated /static/docs queries and large volumes of entities without excessive memory consumption or memory leaks, allowing stable operation in production.

👎 Actual Behavior with Screenshots

The memory usage of the techdocs-backend plugin gradually increases each time it processes requests from the search-backend-module-techdocs, eventually causing the JavaScript heap to overflow and crash.

image

<--- Last few GCs --->
[1:0x65c76b0] 20854057 ms: Scavenge 2021.6 (2074.2) -> 2019.2 (2076.2) MB, 11.9 / 0.0 ms  (average mu = 0.941, current mu = 0.519) allocation failure;
[1:0x65c76b0] 20854079 ms: Scavenge 2023.1 (2076.2) -> 2021.2 (2079.2) MB, 14.9 / 0.0 ms  (average mu = 0.941, current mu = 0.519) allocation failure;
[1:0x65c76b0] 20855608 ms: Scavenge 2026.6 (2079.2) -> 2024.2 (2097.2) MB, 1527.2 / 0.0 ms  (average mu = 0.941, current mu = 0.519) allocation failure;
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb87bc0 node::Abort() [node]
 2: 0xa96834  [node]
 3: 0xd687f0 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xd68b97 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xf462a5  [node]
 6: 0xf5878d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0xf32e8e v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 8: 0xf34257 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 9: 0xf1542a v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
10: 0x12da78f v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
11: 0x170a079  [node]

👟 Reproduction steps

  1. Deploy techdocs-backend in a microservices setup with a dedicated pod.
  2. Configure it with a local builder/generator and AWS S3-type local storage.
  3. Set up a regular polling interval (e.g., 10 minutes) for the search-backend-module-techdocs on /static/docs.
  4. Observe memory usage as the techdocs-backend processes a large set of entities (over 3000, with some missing documentation in S3).

📃 Provide the context for the Bug.

No response

🖥️ Your Environment

Backstage version: 1.32.3
Node.js version: v18.18.0
Setup: Microservices with a dedicated pod for techdocs-backend
TechDocs Configuration: Local builder and generator, publishing files to a local AWS S3-type bucket

👀 Have you spent some time to check if this bug has been raised before?

  • I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

Are you willing to submit PR?

No, but I'm happy to collaborate on a PR with someone else

@pcgqueiroz pcgqueiroz added the bug Something isn't working label Oct 25, 2024
@github-actions github-actions bot added the area:techdocs Related to the TechDocs Project Area label Oct 25, 2024
@awanlin
Copy link
Collaborator

awanlin commented Oct 25, 2024

Hi @pcgqueiroz, can you share your full Backstage details by running yarn backstage-cli info and sharing the results? I think that perhaps your are using the default Lunr search engine here which isn't ideal in production as it's entirely in-memory. We've had issues with it and the TechDocs Search Collator on the Demo site that we actually had to disable it, details on that are here: #23047.

We ship the Postgres Search Engine when creating a new Backstage instance as of a few releases ago to help avoid this as we also recommend using Postgres as your database for Production. I would highly suggest using this. 👍

@pcgqueiroz
Copy link
Author

Hi @awanlin, I am already using the Postgres Search Engine @backstage/plugin-search-backend-module-pg in production.

Also, the search engine backend and their modules are in a different pod (which is not leaking memory), since my backstage in production runs in a micro-services architecture.

The pod that is leaking memory has only the techdocs-backend plugin and a dedicated catalog-backend plugin running.

Here is my configuration:

OS: Linux 6.8.0-47-generic - linux/x64
node: v18.18.0
yarn: 4.5.0
cli: 0.28.1 (installed)
backstage: 1.32.3

Dependencies:
@backstage/app-defaults 1.5.12
@backstage/backend-app-api 0.7.9, 0.9.3, 1.0.1
@backstage/backend-common 0.21.7, 0.23.3, 0.24.1, 0.25.0
@backstage/backend-defaults 0.4.4, 0.5.2
@backstage/backend-dev-utils 0.1.5
@backstage/backend-openapi-utils 0.2.0
@backstage/backend-plugin-api 0.6.21, 0.7.0, 0.8.1, 1.0.1
@backstage/backend-tasks 0.5.27, 0.6.1
@backstage/backend-test-utils 1.0.2
@backstage/catalog-client 1.7.1
@backstage/catalog-model 1.7.0
@backstage/cli-common 0.1.14
@backstage/cli-node 0.2.9
@backstage/cli 0.28.1
@backstage/config-loader 1.9.1
@backstage/config 1.2.0
@backstage/core-app-api 1.15.1
@backstage/core-compat-api 0.3.1
@backstage/core-components 0.14.10, 0.15.1
@backstage/core-plugin-api 1.10.0
@backstage/dev-utils 1.1.2
@backstage/e2e-test-utils 0.1.1
@backstage/errors 1.2.4
@backstage/eslint-plugin 0.1.10
@backstage/frontend-app-api 0.10.0
@backstage/frontend-defaults 0.1.1
@backstage/frontend-plugin-api 0.9.0
@backstage/frontend-test-utils 0.2.1
@backstage/integration-aws-node 0.1.12
@backstage/integration-react 1.2.0
@backstage/integration 1.15.1
@backstage/plugin-api-docs 0.11.11
@backstage/plugin-app 0.1.1
@backstage/plugin-auth-backend-module-atlassian-provider 0.3.1
@backstage/plugin-auth-backend-module-auth0-provider 0.1.1
@backstage/plugin-auth-backend-module-aws-alb-provider 0.2.1
@backstage/plugin-auth-backend-module-azure-easyauth-provider 0.2.1
@backstage/plugin-auth-backend-module-bitbucket-provider 0.2.1
@backstage/plugin-auth-backend-module-bitbucket-server-provider 0.1.1
@backstage/plugin-auth-backend-module-cloudflare-access-provider 0.3.1
@backstage/plugin-auth-backend-module-gcp-iap-provider 0.3.1
@backstage/plugin-auth-backend-module-github-provider 0.2.1
@backstage/plugin-auth-backend-module-gitlab-provider 0.2.1
@backstage/plugin-auth-backend-module-google-provider 0.2.1
@backstage/plugin-auth-backend-module-guest-provider 0.2.1
@backstage/plugin-auth-backend-module-microsoft-provider 0.2.1
@backstage/plugin-auth-backend-module-oauth2-provider 0.3.1
@backstage/plugin-auth-backend-module-oauth2-proxy-provider 0.2.1
@backstage/plugin-auth-backend-module-oidc-provider 0.3.1
@backstage/plugin-auth-backend-module-okta-provider 0.1.1
@backstage/plugin-auth-backend-module-onelogin-provider 0.2.1
@backstage/plugin-auth-backend 0.23.1
@backstage/plugin-auth-node 0.4.17, 0.5.3
@backstage/plugin-auth-react 0.1.7
@backstage/plugin-bitbucket-cloud-common 0.2.24
@backstage/plugin-catalog-backend-module-github 0.7.6
@backstage/plugin-catalog-backend-module-ldap 0.9.1
@backstage/plugin-catalog-backend-module-logs 0.1.3
@backstage/plugin-catalog-backend-module-openapi 0.2.3
@backstage/plugin-catalog-backend-module-scaffolder-entity-model 0.2.1
@backstage/plugin-catalog-backend-module-unprocessed 0.5.1
@backstage/plugin-catalog-backend 1.27.1
@backstage/plugin-catalog-common 1.1.0
@backstage/plugin-catalog-graph 0.4.11
@backstage/plugin-catalog-import 0.12.5
@backstage/plugin-catalog-node 1.13.1
@backstage/plugin-catalog-react 1.14.0
@backstage/plugin-catalog-unprocessed-entities-common 0.0.4
@backstage/plugin-catalog-unprocessed-entities 0.2.9
@backstage/plugin-catalog 1.24.0
@backstage/plugin-devtools-backend 0.4.1
@backstage/plugin-devtools-common 0.1.12
@backstage/plugin-devtools 0.1.19
@backstage/plugin-events-node 0.3.10, 0.4.3
@backstage/plugin-home-react 0.1.18
@backstage/plugin-home 0.8.0
@backstage/plugin-org 0.6.31
@backstage/plugin-permission-backend 0.5.50
@backstage/plugin-permission-common 0.7.14, 0.8.1
@backstage/plugin-permission-node 0.7.32, 0.8.4
@backstage/plugin-permission-react 0.4.27
@backstage/plugin-scaffolder-backend-module-azure 0.2.1
@backstage/plugin-scaffolder-backend-module-bitbucket-cloud 0.2.1
@backstage/plugin-scaffolder-backend-module-bitbucket-server 0.2.1
@backstage/plugin-scaffolder-backend-module-bitbucket 0.3.1
@backstage/plugin-scaffolder-backend-module-confluence-to-markdown 0.3.1
@backstage/plugin-scaffolder-backend-module-gerrit 0.2.1
@backstage/plugin-scaffolder-backend-module-gitea 0.2.1
@backstage/plugin-scaffolder-backend-module-github 0.5.1
@backstage/plugin-scaffolder-backend-module-gitlab 0.6.0
@backstage/plugin-scaffolder-backend 1.26.2
@backstage/plugin-scaffolder-common 1.5.6
@backstage/plugin-scaffolder-node 0.4.11, 0.5.0
@backstage/plugin-scaffolder-react 1.13.2
@backstage/plugin-scaffolder 1.26.2
@backstage/plugin-search-backend-module-catalog 0.2.4
@backstage/plugin-search-backend-module-explore 0.2.4
@backstage/plugin-search-backend-module-pg 0.5.37
@backstage/plugin-search-backend-module-techdocs 0.3.1
@backstage/plugin-search-backend-node 1.3.4
@backstage/plugin-search-backend 1.6.1
@backstage/plugin-search-common 1.2.14
@backstage/plugin-search-react 1.8.1
@backstage/plugin-search 1.4.18
@backstage/plugin-signals-node 0.1.13
@backstage/plugin-signals-react 0.0.6
@backstage/plugin-techdocs-backend 1.11.1
@backstage/plugin-techdocs-common 0.1.0
@backstage/plugin-techdocs-module-addons-contrib 1.1.16
@backstage/plugin-techdocs-node 1.12.12
@backstage/plugin-techdocs-react 1.2.9
@backstage/plugin-techdocs 1.11.0
@backstage/plugin-user-settings-backend 0.2.26
@backstage/plugin-user-settings-common 0.0.1
@backstage/plugin-user-settings 0.8.14
@backstage/release-manifests 0.0.11
@backstage/test-utils 1.7.0
@backstage/theme 0.5.7, 0.6.0
@backstage/types 1.1.1
@backstage/version-bridge 1.0.10

@awanlin
Copy link
Collaborator

awanlin commented Oct 25, 2024

Awesome, that at least eliminates that as a potential problem 👍

@iamEAP
Copy link
Member

iamEAP commented Nov 12, 2024

Hey there! I happened to be investigating something similar and happened upon this thread.

@pcgqueiroz, I believe the problem you're experiencing may be specific to the AWS publisher implementation, related to this line in the TechDocs awsS3 publisher.

Other implementations of the docsRouter will pipe the stream returned from the object store service (e.g. here in the GCS implementation), but the AWS implementation seems to be loading the entire file contents into memory before sending it on to clients.

I unfortunately don't have the bandwidth (nor a nice/easy test environment) to fix this, but my assumption is that removing the streamToBuffer() call and instead piping the stream into the response will reduce memory usage substantially.

@pcgqueiroz
Copy link
Author

pcgqueiroz commented Nov 20, 2024 via email

@pcgqueiroz
Copy link
Author

pcgqueiroz commented Nov 22, 2024

Hey @iamEAP,

I have tried to make the changes you have suggested

- res.send(await streamToBuffer(resp.Body as Readable));
+ const fileStream = resp.Body as Readable;
+ fileStream.pipe(res);

The code worked but it did not fix the memory leak.

Any other ideas that I could test? Thanks

@Ferin79
Copy link

Ferin79 commented Dec 4, 2024

Memory Profiling Snapshot (1 Week)

The following memory profiling snapshot from DataDog indicates a potential memory leak:

Memory Profiling Screenshot

It appears that keyv is leaking memory. This issue was addressed in version 4.3.0 as noted in this GitHub issue.

Currently, Backstage is using an older version (4.0.0) of keyv, hoisted from the following dependency chain:
_project_#backend#@backstage#plugin-scaffolder-backend#@backstage#plugin-scaffolder-backend-module-gitlab#@gitbeaker#node#got#cacheable-request#keyv.

@squid-ney
Copy link
Contributor

Hi @pcgqueiroz and @Ferin79!
I'm digging into this issue more since the suggested change to pipe the stream did not work. @pcgqueiroz could you share how you confirmed the piping did not work?

@Ferin79 Thanks for pointing out the need to upgrade keyv! I'm cautious to say that the keyv library is causing this specific issue though. It looks like keyv is only used in TechDocs for the TechDocs cache and @pcgqueiroz's TechDocs set-up is not using cache. We should still update this dependency though.

@squid-ney
Copy link
Contributor

I think this could be related to the S3Client configuration. I was reading through this issue, aws/aws-sdk-js-v3#3560, and noticed some discussion on the socket settings for v3. Maybe playing around with these socket settings could help the memory usage by setting a max for socket growth and making sure there is a timeout so they can be cleaned up.

@pcgqueiroz
Copy link
Author

Hi @squid-ney, thanks for looking at this issue. Answering your question, when I replaced the buffer by the pipe, the code did work but it didn't solve the memory leak problem. Sorry that I was not enough clear in my comment above.

In my tests, I have the same feeling that the issue is related to the S3Client. Let me know if I can help you in anyway.

@iamEAP
Copy link
Member

iamEAP commented Dec 6, 2024

Do you happen to have the permission framework enabled?

There's one more code path in TechDocs that will cause keyv to be used in that case (in order to cache permission responses for a given user and entity).

@pcgqueiroz
Copy link
Author

Hey @iamEAP , I do not know exactly what you mean by having the permission framework enabled, but I do have the @backstage/plugin-permission-backend enabled.

The techdocs.cache option is not enabled/configured.

@iamEAP
Copy link
Member

iamEAP commented Dec 6, 2024

Yes, this wouldn't respect techdocs.cache. Basically, when permissions.enabled is set to true in app-config.yaml, this middleware is applied to ensure that anyone loading docs content for a given entity has permission to view that entity. It does that check by attempting to load the entity before serving any content associated with it.

This check could happen frequently (because it would be called not just for the HTML content of a page, but also all images/assets loaded on that page), so the result gets cached for a brief moment.

All of that to say: don't rule the keyv issue out entirely. Hopefully it's straightforward to bump that dep in your yarn.lock. Something quick and easy to rule out.

@pcgqueiroz
Copy link
Author

pcgqueiroz commented Dec 6, 2024

As far as I could trace in my yarn.lock file, the only keyv version with memory leak is 4.0.0, and it is only related to the package @backstage/plugin-scaffolder-backend-module-gitlab which I am not using.

Unfortunately I do not think this is related :-(

@Ferin79
Copy link

Ferin79 commented Dec 6, 2024

@pcgqueiroz The package @backstage/plugin-scaffolder-backend-module-gitlab is being installed by @backstage/plugin-scaffolder-backend. So even if you are not using it, it's being installed by scaffolder backend.

Also the got library has leak when using cache. sindresorhus/got#1128

@squid-ney
Copy link
Contributor

Hey @pcgqueiroz a few questions about your TechDocs set-up.

  • In the issue description, you mention using an "AWS S3-type local storage", could you elaborate on this more?
  • Are you setting a custom parallelismLimit for the techdocs collator in search-backend-module-techdocs?

@pcgqueiroz
Copy link
Author

Hi @Ferin79, my environment is setup in micro-services, and I have one dedicated pod to run the techdocs-backend. The scaffolder-backend is not running in the same pod and I don't have any memory leaks on it.

@pcgqueiroz
Copy link
Author

Hi @squid-ney, answering your first question, using an "aws s3-type local storage" means that I am using awsS3 as the techdocs.publisher.type configuration (and I have an on-prem setup for bucket storage).

I am not setting a custom parallelismLimit for the techdocs collator. Should I?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:techdocs Related to the TechDocs Project Area bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants