-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform] Align transform checkpoint range with date_histogram interval for better performance #74004
[Transform] Align transform checkpoint range with date_histogram interval for better performance #74004
Conversation
48634ef
to
944fea1
Compare
6a65d22
to
dc3a876
Compare
Pinging @elastic/ml-core (Team:ML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this an optional feature and add a setting (hello, naming discussion ;-) ). The default should be to write incomplete/interim buckets as we do today. We might identify some cases where we always want to align to bucket boundaries, e.g. if frequency and bucket boundaries are small (e.g for fixed intervals with less than 60s
it might make sense to align per default). To be discussed, we should give users some guidance about when this is useful.
IMO batch transforms should stay as they are, the assumption here is static data and if the user wants it bounded, he can specify a query.
.../src/main/java/org/elasticsearch/xpack/transform/checkpoint/TimeBasedCheckpointProvider.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be discussed, we should give users some guidance about when this is useful.
I had an impression that we want to free the user from the responsibility of enabling this optimization themselves.
Do you envision situations where this optimization is harmful for the user to the point they want to disable it?
Or is it more of a safety measure so that the user has a way out if the optimization doesn't work for them for any unforseen reason?
IMO batch transforms should stay as they are, the assumption here is static data and if the user wants it bounded, he can specify a query.
Ok, makes sense.
.../src/main/java/org/elasticsearch/xpack/transform/checkpoint/TimeBasedCheckpointProvider.java
Outdated
Show resolved
Hide resolved
dc3a876
to
dda0983
Compare
5bd2af5
to
6491904
Compare
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
Done. |
4846ae9
to
785a706
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added 2 comments / questions,
I have a suspicion:
The checkpoint provider is passed in as argument to the indexer and it's also final
. That means changing the setting via _update
won't be applied, but you have to stop and start the transform to force a reload of the checkpoint provider.
If I am right, this is an existing bug. However we could leave it aside for this PR, some re-factoring is probably required. Nevertheless I think we should fix it for this release cycle.
Map.Entry<String, SingleGroupSource> topLevelGroupEntry = groups.entrySet().iterator().next(); | ||
if ((topLevelGroupEntry.getValue() instanceof DateHistogramGroupSource) == false) { | ||
return timestamp; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the reason for requiring the top grouping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under impression that the date histogram grouping must be the first (top) specified in order to split all data into time buckets properly.
If it's not first then some other buckets (e.g.: terms) that are first will make the date histogram buckets mixed.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, nevermind.
I've changed it so that it searches for the first date histogram source that matches on field name.
.../src/main/java/org/elasticsearch/xpack/transform/checkpoint/TimeBasedCheckpointProvider.java
Outdated
Show resolved
Hide resolved
785a706
to
c8c34d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read the PR description Align transform checkpoint range with date_histogram interval for better performance
I understand what that means but I'm not sure what interim_results
means in this context. Perhaps write_partial | incomplete | interim_buckets
or align_checkpoints
or just make apply the optimisation automatically and remove the config option.
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
5a605ea
to
721d6cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read the PR description
Align transform checkpoint range with date_histogram interval for better performance
I understand what that means but I'm not sure whatinterim_results
means in this context. Perhapswrite_partial | incomplete | interim_buckets
oralign_checkpoints
interim_results
is the setting name we came up with during the discussion although I agree it is not perfect. We can of course revisit that.
or just make apply the optimisation automatically and remove the config option.
This is the option we discarded with the assumption there will be users that don't want the optimization for any reason.
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
...t-high-level/src/main/java/org/elasticsearch/client/transform/transforms/SettingsConfig.java
Outdated
Show resolved
Hide resolved
Map.Entry<String, SingleGroupSource> topLevelGroupEntry = groups.entrySet().iterator().next(); | ||
if ((topLevelGroupEntry.getValue() instanceof DateHistogramGroupSource) == false) { | ||
return timestamp; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, nevermind.
I've changed it so that it searches for the first date histogram source that matches on field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
4cc0747
to
2d788ae
Compare
run elasticsearch-ci/bwc |
1 similar comment
run elasticsearch-ci/bwc |
4cca62f
to
a575b77
Compare
I'm merging this with |
* master: (868 commits) Query API key - Rest spec and yaml tests (elastic#76238) Delay shard reassignment from nodes which are known to be restarting (elastic#75606) Reenable bwc tests for elastic#76475 (elastic#76576) Set version to 7.15 in BWC code (elastic#76577) Don't remove warning headers on all failure (elastic#76434) Disable bwc tests for elastic#76475 (elastic#76541) Re-enable bwc tests (elastic#76567) Keep track of data recovered from snapshots in RecoveryState (elastic#76499) [Transform] Align transform checkpoint range with date_histogram interval for better performance (elastic#74004) EQL: Remove "wildcard" function (elastic#76099) Fix 'accept' and 'content_type' fields for search_mvt API Add persistent licensed feature tracking (elastic#76476) Add system data streams to feature state snapshots (elastic#75902) fix the error message for instance methods that don't exist (elastic#76512) ILM: Add validation of the number_of_shards parameter in Shrink Action of ILM (elastic#74219) remove dashboard only reserved role (elastic#76507) Fix Stack Overflow in UnassignedInfo in Corner Case (elastic#76480) Add (Extended)KeyUsage KeyUsage, CipherSuite & Protocol to SSL diagnostics (elastic#65634) Add recovery from snapshot to tests (elastic#76535) Reenable BwC Tests after elastic#76532 (elastic#76534) ...
This PR makes the checkpoint boundaries aligned with top-level
date_histogram
bucket boundaries.The optimization is applied only when the transform config has
date_histogram
as the first group in thegroup_by
list.Relates #62746