-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggs: clean up date_histogram
#9062
Comments
+1 |
Since I just changed the "pre_/post_offset" in #9417 for the histogram I can also look into this. I changed the date_histogram to only use the simpler "offset" option which only allows shifting the bucket start and endpoints. Just to make sure I understand the way offest should work here: If I have two docs, one at 5am and one 7am on Feb 3 and I use a daily bucket with "offset" : "6h", then the first doc should go into the Feb 2nd bucket, keyed with "2015-02-02T06:00:00" and the second in the Feb 3rd bucket. As for the changes in 'pre_zone' and 'post_zone'. As far as I understand it so far 'post_zone' only affects the 'valueForKey' of the underlying Roundings, so it would probably be sufficient to rename 'pre_zone' to 'zone' and drop the other parameter. Or am I missing something here? @jpountz I could work on this further, adapt the existing tests etc. Would need some advice on what to consider in terms of bwc though, because dropping the parameters will affect serialization. Not sure how much of an issues this is when this only goes to v2.0.0. |
Yes.
Elasticsearch should always return UTC dates so I think we need both? The time zone parameter should essentially behave the same way as the If we keep this
For backward compatibility I think we need to do the following:
@clintongormley Would be great if you could confirm that what I wrote above looks good to you. |
Working on this I start to see now why the current options are a bit confusing. Removing the "pre/post offset" part seems fairly straightforward, but for the pre/postZone deletion I had trouble getting all the test to run after my modifications to the TimeZoneRounding classes. When I finally got my Tests running I stumbled into a test failure in IndicesQueryCacheTests which I first thought was due to my code changes, but digging into it I found I can reproduce the failure also on master. There is an assertion in that checks keys of subsequent buckets. When using a postZone(-1) in IndicesQueryCacheTests this can be made to fail. I added examples and an isolated test for this in the "test-timeZoneRounding" branch in my repo at https://github.com/cbuescher/elasticsearch/compare/test-timeZoneRounding @jpountz should I open separate issue for this? |
Haven't dug yet but your description of the issue looks similar to #7673 |
Just to clarify a few things I got really confused about when digging deeper into the current state of TimeZoneRoundings: Currently pre_Zone offset is applied in roundKey() and postZone in valueForKey(). Would it be possible to just do all the zone conversion in roundKey() and return that in UTC after the cleanup? Apart from beeing confusing the current solution leads to things like roundKey() not beeing an idempotent operation when preZone!=0. Also, should one safely assume that the input and output values to the rounding methods are alway UTC after the cleanup? Currently this seems not to be the case since the values returned by roundKey() and valueForKey() can both be in local time zones. |
The reason why it works this way is that these rounding methods are called in HistogramCollector.collect which is a typical bottleneck when running histogram aggregations. By splitting the rounding logic into roundKey and valueForKey, we still call roundKey for every value (typically millions of times), but valueForKey only once for every unique roundKey (typically only a couple tens). I don't know if roundKey should be idempotent, but X -> valueForKey(roundKey(X)) certainly should. That said, we should favour correctness over performance so if this distinction makes things harder to reason about, let's disable it for now (eg. by putting all the logic into roundKey and making valueForKey return the provided argument).
The return value of roundKey is really an identifier of a bucket, it doesn't carry any meaning and has no notion of timezone. However, indeed, one can safely assume that the input of roundKey is a UTC date, and the output of valueForKey should be a UTC date as well. I know the current state of timezone handling is a bit messy so if it makes things easier for you, feel free to break this issue into several smaller issues/pull requests. |
Probably separating removal of pre/postOffset from the whole time zone issue in two pull requests makes sense. I think that part is almost done, will revise the tests and go over the documentation for that one tomorrow and maybe can issue a PR then. Also merging this to 1.x seperately would be good. |
From discussion this morning: also the 'pre_zone_adjust_large_interval' should be removed if possible. |
+1 |
…ffset' and 'post_offset' Add offset option to 'date_histogram' replacing and simplifying the previous 'pre_offset' and 'post_offset' options. This change is part of a larger clean up task for `date_histogram` from issue elastic#9062.
Opened a pull request for the 'offset' part of this task. Will do the changes needed for the deprecation of 'pre/postOffset' on the 1.x branch separately. |
…ffset' and 'post_offset' Add offset option to 'date_histogram' replacing and simplifying the previous 'pre_offset' and 'post_offset' options. This change is part of a larger clean up task for `date_histogram` from issue #9062.
Add offset option to 'date_histogram', deprecating the previous 'pre_offset' and 'post_offset' options. This change is part of a larger clean up task for `date_histogram` from issue elastic#9062.
Add offset option to 'date_histogram', deprecating the previous 'pre_offset' and 'post_offset' options. This change is part of a larger clean up task for `date_histogram` from issue #9062.
Writing randomized tests I started wondering if the 'interval' setting should only be allowed to be positive. This isn't supported at the moment but if I issue a date_histogram request on with negative interval setting I seem to get an OOM. |
+1 on validating the interval |
Should I open a separate issue for a fix for that also on 1.x? Using negative intervals I think I can reliable reproduce an OOM error. This can be rejected quiet early in the REST call already. |
Yes, please! |
I think I have the time zone simplification in #9637. However there are a few potential simplifications that I would still like to discuss in that PR. |
Something else that would be nice to do when we have a single time_zone parameter would be to use this time_zone to format bucket keys. Here is a quick code example that probably better explains what I mean: FormatDateTimeFormatter formatter = DateFieldMapper.Defaults.DATE_TIME_FORMATTER;
// this is midnight in Berlin
DateTime dateTime = formatter.parser().parseDateTime("2015-02-03T23:00:00Z");
// What we are basically doing today:
System.out.println(formatter.printer().print(dateTime));
// prints 2015-02-03T23:00:00.000Z
// What we could do instead
System.out.println(formatter.printer().withZone(DateTimeZone.forID("Europe/Berlin")).print(dateTime));
// prints 2015-02-04T00:00:00.000+01:00 Like the post_zone parameter this helps make buckets look like they still start at midnight, but on the contrary to the post_zone parameter we did not move the date to a different time zone, we are only printing it in the desired time zone (hence the "+01:00" at the end). |
Printing bucket keys in local time zones is another great idea, maybe this should be configurable as well? I would prefer opening a separate issue from the current PR though. |
+1 on a different issue, I was just suggesting it here since I think it's quite tied to this issue. I tend to think as time zones as something which is quite complicated to get right (a bit like encodings) so the fewer parameters we have about time zones, the better I think? Maybe we could just architecture the code so that adding this parameter would be straightforward to do in the future (if we ever decide to add it)? |
Just opened the above follow up issue for using the time zone formatting for bucket keys. |
The options for pre_zone and post_zone in date_histogram are going to be removed in 2.0 (docs and ocs and ee #9062) in favor of the already existing time_zone option. This commit deprecates those fields using ParseFields and adds deprecation notice to docs and the migration guide.
The
date_histogram
aggregation supports bothpre_zone
andpost_zone
. This is bad because specifying two different values for these parameters makes elasticsearch return dates that are not in UTC. Instead we should only support azone
parameter that is the time zone that we will use to compute the buckets, and return buckets in UTC.Similarly we have
pre_offset
andpost_offset
while we should only support a singleoffset
parameter that would allow to do things like:The text was updated successfully, but these errors were encountered: