Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store: consumes lots of memory at startup and loop restart cause OOM #6643

Closed
chalut01 opened this issue Aug 24, 2023 · 30 comments
Closed

Store: consumes lots of memory at startup and loop restart cause OOM #6643

chalut01 opened this issue Aug 24, 2023 · 30 comments

Comments

@chalut01
Copy link

chalut01 commented Aug 24, 2023

Thanos Version v0.32.0
Everything is set up the same way, only the version is different. Are there others similar to me?

Screenshot 2566-08-24 at 13 33 49

@rgarcia89
Copy link
Contributor

I have experienced exactly the same issue.

Closed my ticket as this one covers the same issue #6644

@pahaeanx
Copy link

Same here. Can't get it to start up with 6GB of RAM when it used to run with ~3GB.

@yeya24
Copy link
Contributor

yeya24 commented Aug 24, 2023

I can only think about #6509 that might change the startup memory usage of store gateway. But ideally this change should improve the mem usage.

Do you enable lazy index header? Can you share the config of store gateway?

@pahaeanx
Copy link

I can only think about #6509 that might change the startup memory usage of store gateway. But ideally this change should improve the mem usage.

Do you enable lazy index header? Can you share the config of store gateway?

No, I use a pretty vanilla config I'd say. Failing config (formatted for readability):

thanos store --max-time=-1w --grpc-address=localhost:15000 --http-address=localhost:15001 \ 
--data-dir=/var/lib/thanos-cache/store  --objstore.config-file=/etc/thanos/s3.yml \
--grpc-server-tls-cert=/etc/thanos/thanos.cer --grpc-server-tls-key=/etc/thanos/thanos.key \
--sync-block-duration=30m

This is currently running with 0.30.2 and fails with 0.32.0.

@rgarcia89
Copy link
Contributor

rgarcia89 commented Aug 24, 2023

@yeya24 I deploy the thanos store using kube-thanos. It starts with the following args in the manifest. Currently on v0.31.0

        - store
        - --log.level=info
        - --log.format=logfmt
        - --data-dir=/var/thanos/store
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --ignore-deletion-marks-delay=24h

@antikilahdjs
Copy link

I have the same issue and upgraded to v31, all the other components works perfectly in 0.32 version but the store consume 1tb

@GiedriusS
Copy link
Member

Maybe it would be possible for you to take a pprof memory profile during bootup of Thanos Store and share it here? Thanos Store exposes a pprof endpoint on /debug/pprof. 🤔

@bboysoulcn
Copy link

the same as you

@chalut01
Copy link
Author

chalut01 commented Aug 26, 2023

@GiedriusS I'm sorry, but I can't actually try out and get pprof on production environment.
However, In my development, It have a small data and everything works well with v0.32.0.

Can someone share a pprof?

@rgarcia89
Copy link
Contributor

Maybe it would be possible for you to take a pprof memory profile during bootup of Thanos Store and share it here? Thanos Store exposes a pprof endpoint on /debug/pprof. 🤔

@GiedriusS I have thanos running as container in kubernetes using kube-thanos. How can I find this endpoint?

@yeya24
Copy link
Contributor

yeya24 commented Aug 28, 2023

@rgarcia89 Can you port forward one of your store gateway pod using its http port, for example it is 8080.

curl http://localhost:8080/debug/pprof/heap > heap.pprof

You can get the heap profile by running the command above. Make sure to do it when store gateway starts...

@rgarcia89
Copy link
Contributor

@yeya24 here you go - including the heap of running 0.31.0 thanos store vs the heap of a just started 0.32.0 thanos store which then crashed

heap-pprof.zip

@MichaHoffmann
Copy link
Contributor

Oh thats the cuckoo filter completely.

@MichaHoffmann
Copy link
Contributor

Do you have some figures around your cardinality?

@rgarcia89
Copy link
Contributor

@MichaHoffmann anything specific you are looking for? Thanos is currently managing a time series of 1792044 entries

@MichaHoffmann
Copy link
Contributor

This filter should scale ( theoretically ) with the amount of label names you have. But for 128gb it would need like 100s of millions ( again theoretically, maybe there is a bug somewhere )

@rgarcia89
Copy link
Contributor

rgarcia89 commented Aug 28, 2023

Good question. I can see everything running smooth with v0.31.0 and also haven't seen any kind of alerts or issues in regards of high cardinality

@MichaHoffmann
Copy link
Contributor

Good question. I can see everything running smooth with v0.31.0 and also haven't seen any kind of alerts or issues in regards of high cardinality

That cuckoo filter was introduced in 0.32.0 so that makes sense

@rgarcia89
Copy link
Contributor

Maybe there is something counting wrong in that filter. Otherwise I am not sure how 128gb can be justified

@saswatamcode
Copy link
Member

#6669 should address this

@GiedriusS
Copy link
Member

How does it look like with v0.32.1?

@rgarcia89
Copy link
Contributor

Very good. Issue is gone on my clusters

@pahaeanx
Copy link

Unfortunately still crashing with 0.32.1. Can't do any more digging today so all I can offer for now is the crash log. But it seems like it OOM crashed again. I can try and scale the VM tomorrow so I can maybe get it to start up.

crash.log.tar.gz

@farodin91
Copy link
Contributor

It looks like store queries around 50% more mem as before, but no escalation.

@MichaHoffmann
Copy link
Contributor

thast not a oom crash!

Aug 28 15:38:40 440019-prod-observer01 thanos[75687]: fatal error: ts=2023-08-28T13:38:40.784103104Z caller=bucket.go:688 level=info msg="loaded new block" elapsed=261.655268ms id=01H8GGHPJEQJ5XWESMV004DW8J
Aug 28 15:38:40 440019-prod-observer01 thanos[75687]: concurrent map iteration and map write

@mateuszdrab
Copy link

Works for me now, I can check memory metrics later.

@MichaHoffmann
Copy link
Contributor

#6675

@antikilahdjs
Copy link

Works perfectly and thank you guys to work hard to delivery the best to us

@MichaHoffmann
Copy link
Contributor

@chalut01 can we close the issue?

@chalut01
Copy link
Author

Fixed in v0.32.2 !!
Thank you everyone for your hard work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests