Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard request cache and script queries/aggregations #49321

Closed
AlexP-Elastic opened this issue Nov 19, 2019 · 8 comments
Closed

Shard request cache and script queries/aggregations #49321

AlexP-Elastic opened this issue Nov 19, 2019 · 8 comments
Labels
:Analytics/Aggregations Aggregations >enhancement :Search/Search Search-related issues that do not fall into other categories

Comments

@AlexP-Elastic
Copy link

Support for caching queries including scripts:

Although the documentation for the shard request query currently says:

If your query uses a script whose result is not deterministic (e.g. it uses a random function or references the current time) you should set the request_cache flag to false to disable caching for that request

In practice the cache is skipped whenever ScriptService is used

This is intentional, per @jimczi:

this is intentional ... we cannot ensure that the result is deterministic.

An alternative (which per the docs seems consistent with how some other scenarios are handled) would be to default to skipping the cache in such cases but allow use of the existing request_cache=true param for clients to declare their script is deterministic and can be cached

Note that scripted aggregations are often very expensive and therefore great candidates to be cached!

@mayya-sharipova mayya-sharipova added the :Search/Search Search-related issues that do not fall into other categories label Nov 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@jpountz
Copy link
Contributor

jpountz commented Nov 19, 2019

I'm seeing scripted queries/aggs as a way to trade performance for flexibility, as they allow to do things that had not been planned at index time. Since these are already trading performance for something else, it doesn't feel right to me to now trade correctness for performance by enabling caching when the user declares it is safe.

Maybe tell us more about your usage of scripts? I wonder that you might be using scripts as a workaround to a missing aggregation feature?

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@AlexP-Elastic
Copy link
Author

I'm personally a heavy script user and my usage patterns certainly shouldn't be taken as representative :) but for the purposes of discussion, my uses of scripts include:

  • Formatting and transforming fields in Kibana using the script field functionality
    • (unclear to what extent cache is needed for this scenario ... eg if I create a visualization and share the link, a cache is one way of handling the resulting spike? Is the shard request the right cache for that?)
  • Similarly, I use a spreadsheet connector (https://github.com/Alex-At-Home/elasticsearch-sheets) which lets (/encourages!) you to create script fields and scripts for queries and aggregations (and build quite complex transforms between the source data and the spreadsheet's cell range using the scripted_metric aggregation)
    • (obviously a random app I built isn't evidence of any requirement though! The case of caching would be similar to the Kibana one, ie sharing a link to lots of people)
  • An aggregation I use somewhat commonly involves having a fairly frequently changing (or user entered) table of weights, and then using that lookup table to weight the results of a terms aggregation
    • (this is actually the thing I was experimenting with the performance of when I came upon the out-of-date documentation and starting asking around)

So it could be summarized as a mix of "missing aggregation features", (related) "trading off performance to provide (query-time) flexibility". and to a lesser extent "trading off performance to keep all logic in one place"

In all cases I'm not so much trading off "correctness for performance" with cache, I'm trading off memory for performance (based on the knowledge/expectation that there will be a large number of queries with the same results in a given time period)

@jpountz
Copy link
Contributor

jpountz commented Nov 19, 2019

Thanks for the detailed reply!

eg if I create a visualization and share the link, a cache is one way of handling the resulting spike? Is the shard request the right cache for that?

This is exactly the reason why we have this cache. :)

a fairly frequently changing (or user entered) table of weights, and then using that lookup table to weight the results of a terms aggregation

That one sounds interesting to me. Do I understand correctly that instead of sorting terms by doc_count descending, you want to sort them by descending weight? Or maybe even descending weight*doc_count? Can you tell me more about the higher-level use-case, is it something like a rollup?

To be clear I'm not against caching scripted queries or aggs, but I'm worried about allowing users to cache data that is not cacheable. My preferred way of fixing this would be by enabling Painless to tell us when a script is deterministic or not, so that we could make caching decisions accordingly. @jdconrad @stu-elastic Do you think it'd be doable?

@polyfractal
Copy link
Contributor

An aggregation I use somewhat commonly involves having a fairly frequently changing (or user entered) table of weights, and then using that lookup table to weight the results of a terms aggregation

This caught my eye as well, would love to know more. We've talked about making bucket_sort scriptable, which would allow a lot more custom sorting of agg buckets. I realize that's still using a script, but being a post-processing step it'd also be a lot faster since it would only invoke the script once.

(although it would have different semantics since it's only sorting the final list of buckets, instead of all the buckets at runtime).

@jdconrad
Copy link
Contributor

jdconrad commented Nov 19, 2019

So, I think we could make this possible through Painless for which scripts are deterministic, but I don't think it would be all that useful unless we are safe to assume that any access to docvalues (or _source) or whatever else the user is doing would be flushed from the cache upon changes. And if anything is done from user-defined params (are weights done this way or is a new script created every time with constants?) then it's also not deterministic as we explicitly expect those to be changed throughout a script's life.

The other thing is right now Painless isn't really aware of something like doc and just views this input as a simple Map. We would need to specialize certain inputs to be known as deterministic.

Edit: After thinking about this I realized that all those values are deterministic because otherwise the cache wouldn't work. (Oops.) I think Painless only has one non-deterministic methods right now in randomUUID.

@stu-elastic
Copy link
Contributor

Fixed by the following changes:

SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

7 participants