-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better defaults for DocumentSubsetBitsetCache #49260
Comments
Pinging @elastic/es-security (:Security/Authorization) |
Broadly there's 2 options for cache sizing
(Small with a short TTL is also an option, but is not particularly helpful, and Large with a long ttl just ends up holding too much memory for too long). When adding this cache, I opted for approach (1) based on the premise that some users/roles would only be used a few times a week. I think it makes sense to switch the defaults to large-weight+short-ttl because that matches the use cases where the cache is most needed. |
+1 |
The Document Level Security BitSet Cache (see #43669) had a default configuration of "small size, long lifetime". However, this is not a very useful default as the cache is most valuable for BitSets that take a long time to construct, which is (generally speaking) the same ones that operate over a large number of documents and contain many bytes. This commit changes the cache to be "large size, short lifetime" so that it can hold bitsets representing billions of documents, but releases memory quickly. The new defaults are 10% of heap, and 2 hours. This also adds some logging when a single BitSet exceeds the size of the cache and when the cache is full. Resolves: #49260
The Document Level Security BitSet Cache (see elastic#43669) had a default configuration of "small size, long lifetime". However, this is not a very useful default as the cache is most valuable for BitSets that take a long time to construct, which is (generally speaking) the same ones that operate over a large number of documents and contain many bytes. This commit changes the cache to be "large size, short lifetime" so that it can hold bitsets representing billions of documents, but releases memory quickly. The new defaults are 10% of heap, and 2 hours. This also adds some logging when a single BitSet exceeds the size of the cache and when the cache is full. Resolves: elastic#49260 Backport of: elastic#50535
The Document Level Security BitSet Cache (see elastic#43669) had a default configuration of "small size, long lifetime". However, this is not a very useful default as the cache is most valuable for BitSets that take a long time to construct, which is (generally speaking) the same ones that operate over a large number of documents and contain many bytes. This commit changes the cache to be "large size, short lifetime" so that it can hold bitsets representing billions of documents, but releases memory quickly. The new defaults are 10% of heap, and 2 hours. This also adds some logging when a single BitSet exceeds the size of the cache and when the cache is full. Resolves: elastic#49260
Initially the DLS cache was unbounded. This was not due to us being careless, but rather because Lucene makes the assumption that accessing live docs is cheap, and so we didn't it to run in linear time on cache miss. Unfortunately the ability to use templated role queries means you can have an unbounded number of bit sets in the cache, so we had to rollback this decision and move to a size-bound cache in order to avoid out-of-memory errors. We got some feedback that this has considerably slowed down some clusters, and the default of
50MB
that this cache may use is way too low.We should probably move to a number that is more generous, and also depends on the available memory on the node, e.g.
10%
?I wonder whether we should also decrease the default TTL. I think the TTL is important on this cache because it helps avoid keeping that cache full all the time in case some role queries are rarely used, which is probably typical when templated role queries are used. But the current default of 1 week means that someone running only one query every week would always use memory for this cache that might be better spent elsewhere. What about decreasing this default to maybe a couple hours?
The text was updated successfully, but these errors were encountered: