-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Bug Report: Out of Memory Error when Collating TechDocs for Search #23047
Comments
Interesting. Last fix we did for OOM recently was due to insufficiently bounded parallelism. Looking at the code of the collator, it is slightly hard to follow but it does try to leverage a limiter. Might need to dig deeper into that code. backstage/plugins/search-backend-module-techdocs/src/collators/DefaultTechDocsCollatorFactory.ts Line 131 in 25adbdd
|
Updated the description to include that we are using SQLite for the database which is in-memory and also using Lunr to do the search indexing which is also in-memory. I'm wondering if perhaps these aren't helping either? |
yeah that's an interesting point - i was hoping to look a bit closer at this today but didn't have the time to - if there's any way thar you can boil this down to really being ONLY about that collator or so, that'd of course be stellar |
I think the issue is in the Lunr indexer, which keeps all objects in memory until the collator finishes emitting items.
|
@vinzsvi but for a |
This does seem to be an issue with lunr. I made a small sample repo here: https://github.com/wss-dogara/techdocs-oom-bug It takes a search_index.json generated from the backstage repo docs folder (~20MB in size) and adds it to a builder the same was that |
Hi, I think we are facing the same on our production and pp environment (without any user activity)
Or monitoring discover OOM after 2-3 days even without any user activity on the portal. We can see the memory is increasing until reaching the limit We can see on the logs before oom
Thanks |
The recommendation is to switch to another search engine in production, as mentioned here: https://backstage.io/docs/features/search/search-engines#lunr
|
Unless a configuration issue we are using the postgres search engine const searchEngine = await PgSearchEngine.fromConfig(env.config, { database: env.database });
const indexBuilder = new IndexBuilder({
logger: env.logger,
searchEngine,
});
indexBuilder.addCollator({
schedule,
factory: DefaultCatalogCollatorFactory.fromConfig(env.config, {
discovery: env.discovery,
tokenManager: env.tokenManager,
}),
});
indexBuilder.addCollator({
schedule,
factory: DefaultTechDocsCollatorFactory.fromConfig(env.config, {
discovery: env.discovery,
logger: env.logger,
tokenManager: env.tokenManager,
}),
});
indexBuilder.addCollator({
schedule: schedule,
factory: AnnouncementCollatorFactory.fromConfig({
logger: env.logger,
discoveryApi: env.discovery,
}),
}); |
Interesting thank you for the debugging info 🙏. Definitely something to look into it |
Hi @jonesbusy, your logs says When I was having issues with some bad content causing our indexing to fail completely I added an Also, just a more general comment that search runs on a schedule to keep it's indexes up to date so it's not necessarily surprised that it could use memory even when users are not actively using it. |
I can see both task running more less at the same time
probably due to the
The oom always follow the execution of those tasks |
Thanks! What's the size of your catalog and how many entities have TechDocs? Can you try running with debug on for the logs? It will print out the entities in each batch that way. I might pair that with a smaller default batch size as well. Would help tell if this is related to a large |
I will answer with more details early next week but in our preproduction environment we have perhaps 50 techdocs entity. For the catalog it's similar but if we consider user and groups it's more than 2000 thousand (not sure if relevant for the search indexer). That's why it confuse me because our catalog and techdoc are very small dor the moment. Thanks for the helps ans tips to debug this issue. |
Thanks for the follow up. Yeah, it's all very odd to me. My experience with PR and search was very solid, memory usage was pretty minimal, never had OOM. The Catalog had about 750+ entities with about 80% having TechDocs. We had about 500+ Users and Groups. |
From what I understood this as also been disabled on the demo backstage : https://github.com/backstage/demo/blob/b4d9b44c8cfb8ee5b517a23b91d0de7db12a1b3f/packages/backend/src/index.ts#L31 On our side we still have high memory usage and trigger OOM after some point We are not on Backstage 1.27.5 and all plugin migrated to new backend system What I'm not really sure is the engine used for each Lunr or postgres (not sure were to look the config) |
@jonesbusy, yes, this issue was logged specifically for the issue we ran into on the Demo site. The Demo site is an outlier though as it doesn't follow the recommended patterns for a "production" deployment - it runs on SQLite and uses Lunr for search. It's also resource constrained, adding more RAM would also solve the problem, the Demo site was not seeing the same increasing memory but a one time OOM error that would crash it entirely in a loop. |
@jonesbusy can you please log a new issue for your OOM error, please? I think that makes more sense as again this issue is pretty unique to the Demo site. |
interesting, I didn't realize this was due to the new BE and the tech docs search, it started when I upgraded Backstage and switched 90% of the plugins to the new BE + updated to the latest Docker image in the doc. In our case, we solved it with the old way of setting node memory Our prod/staging uses the Postgres DB, and we have like 20 tech docs and 900 locations (which I guess translates to 900 entries in the catalog, between AWS resources, users, components & API) I CC @LegendSebastianoL that's my work account. |
Hi @sebalaini (@LegendSebastianoL), this issue is really specific to the unusual AND not recommended setup we have for the Demo site. If you are using both SQLite as your Database and the Lunr search engine this is not the recommended setup for any non-local environment. I recommend using Postgres as your database and at minimum Postgres as your search engine, Elasticsearch would be a more advanced option. If you have already done that and are still having OOM issue then I would log an issue for that as it's likely not to be the same issue identified here. |
our setup uses PG both locally and in prod but I think we aren't using the PG search engine anymore since we switched to the new BE, this is our current BE
I remember this config in the old BE but can't see anything in the new, is there some doc about how to setup the search in the new BE with PG? |
OK I think this should add the PG search engine right? after adding the plugin I added this config:
and the highlight disappeared, so I guess now we are using the postgres search engine |
As the |
Thanks @LegendSebastianoL it's exaclty what I needed and was missing import and config (I didn't find anywhere this documentation). @camilaibs I confirm now that the search is done on Postgres. I will reopen an other issue if I still experience OOM in the next few days. |
hey @jonesbusy the doc TBH is missing, I just tried to use the new standard to import plugins in the new BE and checked that the plugin in the upstream had the new BE config :) the whole search doc is for the old BE. |
📜 Description
Currently when we enable TechDocs search in the Backstage Demo site we run into an Out of Memory error which crashes the container. This happens shortly after start up and causes a crashes loop.
Some important details:
ENV NODE_OPTIONS "--max-old-space-size=1000"
(locally this seems to break between 1250 and 1500)👍 Expected behavior
We should not hit an out of memory error when collating TechDocs for Search.
👎 Actual Behavior with Screenshots
First we see "Collating documents for techdocs via DefaultTechDocsCollatorFactory" in the logs, then it tries to get the
search_index.json
for one of the entities, in this case for: https://demo.backstage.io/catalog/default/component/backstage. There is a short delay and then we get this error and the app crashes:👟 Reproduction steps
This is pretty tricky, I have a branch that can be used at a starting point here: https://github.com/awanlin/backstage/tree/debug/techdocs-search
To us this branch to test with do the following:
yarn install
,yarn tsc
andyarn build:backend --config ../../app-config.yaml --config ../../app-config.local.yaml
docker image build . -t backstage --progress=plain
docker run -it -p 7007:7007 backstage
Notice: after a minute or so it will crash with the out of memory previously shared
📃 Provide the context for the Bug.
As this causes a crash loop we aren't able to enable TechDocs search in the Demo site. This would be nice to showcase there for those looking at Adopting Backstage!
🖥️ Your Environment
I think the critical environment specific details are: the size of the index file, using GCS bucket, the new backend system, and the limited amount of RAM.
👀 Have you spent some time to check if this bug has been raised before?
🏢 Have you read the Code of Conduct?
Are you willing to submit PR?
No, but I'm happy to collaborate on a PR with someone else
The text was updated successfully, but these errors were encountered: