auto-interval date histogram - 6.x backport #32107

pcsanwald · 2018-07-16T19:03:39Z

6.x backport of #28993. Opening a PR to ensure CI is green.

* Adds a new auto-interval date histogram This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes elastic#9572 * Adds documentation * Added sub aggregator test * Fixes failing docs test * Brings branch up to date with master changes * trying to get tests to pass again * Fixes multiBucketConsumer accounting * Collects more buckets than needed on shards This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result * Revert "Collects more buckets than needed on shards" This reverts commit 993c782. * Adds ability to merge within a rounding * Fixes nonn-timezone doc test failure * Fix time zone tests * iterates on tests * Adds test case and documentation changes Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets * Fixes performance bug The bug meant that getAppropriate rounding look a huge amount of time if the range of the data was large but also sparsely populated. In these situations the rounding would be very low so iterating through the rounding values from the min key to the max keey look a long time (~120 seconds in one test). The solution is to add a rough estimate first which chooses the rounding based just on the long values of the min and max keeys alone but selects the rounding one lower than the one it thinks is appropriate so the accurate method can choose the final rounding taking into account the fact that intervals are not always fixed length. Thee commit also adds more tests * Changes to only do complex reduction on final reduce * merge latest with master * correct tests and add a new test case for 10k buckets * refactor to perform bucket number check in innerBuild * correctly derive bucket setting, update tests to increase bucket threshold * fix checkstyle * address code review comments * add documentation for default buckets * fix typo

elasticmachine · 2018-07-16T19:03:43Z

Pinging @elastic/es-search-aggs

pcsanwald · 2018-07-17T23:06:06Z

.../elasticsearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregationBuilder.java

+            .reduce(Integer::max).get();
+        Settings settings = context.getQueryShardContext().getIndexSettings().getNodeSettings();
+        int maxBuckets = MultiBucketConsumerService.MAX_BUCKET_SETTING.get(settings);
+        if (maxBuckets >= 0) {


@colings86 I added this check for 6.x, as the search.max_buckets parameter is optional. I removed the test, but if we want to keep it, I could update the build.gradle for the integration tests with the cluster setting for search.max_buckets. WDYT?

also: it seems like in 7.x, we're enforcing a soft limit be default, but, should I port this fix over for the case where the user sets the default to -1?

hmm I see. A few thoughts on this:

I added this check for 6.x, as the search.max_buckets parameter is optional.

From looking at the 6.x version of MultiBucketConsumerService it seems that you are right that MultiBucketConsumerService.MAX_BUCKET_SETTING is optional (defaults to -1) but we have some logic in there that if the setting is -1 and the number of buckets used is over 10,000 we log a deprecation warning stating that it the request will fail from 7.0 onwards. I wonder if we should have similar logic here so that if mxaBuckets < 0 && numBuckets > (MultiBucketConsumerService. SOFT_LIMIT_MAX_BUCKETS / maxRoundingInterval) we log a deprecation warning that in 7.0+ the request will fail?

I could update the build.gradle for the integration tests with the cluster setting for search.max_buckets. WDYT?

I think we should not do this. The tests should use "out of the box" defaults unless they are specifically testing a cluster setting's behaviour IMO.

also: it seems like in 7.x, we're enforcing a soft limit be default, but, should I port this fix over for the case where the user sets the default to -1?

I don't think this will be necessary since it appears in master we set the minimum allowed value of the limit to be 0 so it should never need to deal with a value of -1 here

pcsanwald · 2018-07-18T20:41:24Z

.../elasticsearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregationBuilder.java

+                    " must be less than " + bucketCeiling);
+            }
+        } else if ( numBuckets > (MultiBucketConsumerService.SOFT_LIMIT_MAX_BUCKETS / maxRoundingInterval)) {
+            DEPRECATION_LOGGER.deprecated("This request will fail in 7.x, because number of buckets, " +


@colings86 I've added a deprecation warning here.

👍 I think we should add a test that checks for this deprecation warning and then I think this back port is good to go

pcsanwald · 2018-07-29T23:22:46Z

@colings86 this one is ready for final review

colings86

LGTM

@throws

* 6.x: Fix scriptdocvalues tests with dates Correct minor typo in explain.asciidoc for HLRC Fix painless whitelist and warnings from backporting #31441 Build: Add elastic maven to repos used by BuildPlugin (#32549) Scripting: Conditionally use java time api in scripting (#31441) [ML] Improve error when no available field exists for rule scope (#32550) [ML] Improve error for functions with limited rule condition support (#32548) [ML] Remove multiple_bucket_spans [ML] Fix thread leak when waiting for job flush (#32196) (#32541) Painless: Clean Up PainlessField (#32525) Add @AwaitsFix for #32554 Remove broken @link in Javadoc Add AwaitsFix to failing test - see #32546 SQL: Added support for string manipulating functions with more than one parameter (#32356) [DOCS] Reloadable Secure Settings (#31713) Fix compilation error introduced by #32339 [Rollup] Remove builders from TermsGroupConfig (#32507) Use hostname instead of IP with SPNEGO test (#32514) Switch x-pack rolling restart to new style Requests (#32339) [DOCS] Small fixes in rule configuration page (#32516) Painless: Clean up PainlessMethod (#32476) SQL: Add test for handling of partial results (#32474) Docs: Add missing migration doc for logging change Build: Remove shadowing from benchmarks (#32475) Docs: Add all JDKs to CONTRIBUTING.md Logging: Make node name consistent in logger (#31588) High-level client: fix clusterAlias parsing in SearchHit (#32465) REST high-level client: parse back _ignored meta field (#32362) backport fix of reduceRandom fix (#32508) Add licensing enforcement for FIPS mode (#32437) INGEST: Clean up Java8 Stream Usage (#32059) (#32485) Improve the error message when an index is incompatible with field aliases. (#32482) Mute testFilterCacheStats Scripting: Fix painless compiler loader to know about context classes (#32385) [ML][DOCS] Fix typo applied_to => applies_to Mute SSLTrustRestrictionsTests on JDK 11 Changed ReindexRequest to use Writeable.Reader (#32401) Increase max chunk size to 256Mb for repo-azure (#32101) Mute KerberosAuthenticationIT fix no=>not typo (#32463) HLRC: Add delete watch action (#32337) Fix calculation of orientation of polygons (#27967) [Kerberos] Add missing javadocs (#32469) Fix missing JavaDoc for @throws in several places in KerberosTicketValidator. Make get all app privs requires "*" permission (#32460) Ensure KeyStoreWrapper decryption exceptions are handled (#32472) update rollover to leverage write-alias semantics (#32216) [Kerberos] Remove Kerberos bootstrap checks (#32451) Switch security to new style Requests (#32290) Switch security spi example to new style Requests (#32341) Painless: Add PainlessConstructor (#32447) Update Fuzzy Query docs to clarify default behavior re max_expansions (#30819) Remove > from Javadoc (fatal with Java 11) Tests: Fix convert error tests to use fixed value (#32415) IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (#32374) auto-interval date histogram - 6.x backport (#32107) [CI] Mute DocumentSubsetReaderTests testSearch [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints TEST: testDocStats should always use forceMerge (#32450) TEST: Avoid deletion in FlushIT AwaitsFix IndexShardTests#testDocStats Painless: Add method type to method. (#32441) Remove reference to non-existent store type (#32418) [TEST] Mute failing FlushIT test Fix ordering of bootstrap checks in docs (#32417) Wrong discovery.type for azure in breaking changes (#32432) Mute ConvertProcessorTests failing tests TESTS: Move netty leak detection to paranoid level (#32354) (#32425) Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (#32390) [Kerberos] Avoid vagrant update on precommit (#32416) TEST: Avoid triggering merges in FlushIT [DOCS] Fixes formatting of scope object in job resource Switch x-pack/plugin to new style Requests (#32327) Release requests in cors handle (#32410) Remove BouncyCastle dependency from runtime (#32402) Copy missing segment attributes in getSegmentInfo (#32396) Rest HL client: Add put license action (#32214) Docs: Correcting a typo in tophits (#32359) Build: Stop double generating buildSrc pom (#32408) Switch x-pack full restart to new style Requests (#32294) Painless: Clean Up PainlessClass Variables (#32380) [ML] Consistent pattern for strict/lenient parser names (#32399) Add Restore Snapshot High Level REST API Update update-settings.asciidoc (#31378) Introduce index store plugins (#32375) Rank-Eval: Reduce scope of an unchecked supression Make sure _forcemerge respects `max_num_segments`. (#32291)

colings86 and others added 3 commits July 16, 2018 14:40

fix typo

525b846

fix import

9a8481f

pcsanwald added :Analytics/Aggregations Aggregations v6.4.0 labels Jul 16, 2018

Paul Sanwald added 4 commits July 16, 2018 20:13

fix compilation error

8b01423

fix checkstyle

74d66d3

remove test case that assumes soft limit for bucket

6021a31

remove superfluous line

33e5ca3

pcsanwald commented Jul 17, 2018

View reviewed changes

add deprecation warning and fix checkstyle

1aa4ee6

pcsanwald commented Jul 18, 2018

View reviewed changes

colings86 mentioned this pull request Jul 22, 2018

[DOCS] Adds release highlights for search for 6.4 #32095

Merged

Paul Sanwald added 3 commits July 25, 2018 08:36

add a test case for deprecation warning

780f4f0

Merge branch '6.x' into auto-interval-histo-6.x

26d4647

Merge branch '6.x' into auto-interval-histo-6.x

a235be4

colings86 approved these changes Jul 30, 2018

View reviewed changes

pcsanwald merged commit 7e1a1fe into elastic:6.x Jul 30, 2018

pcsanwald deleted the auto-interval-histo-6.x branch July 30, 2018 12:45

lcawl added the >non-issue label Aug 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto-interval date histogram - 6.x backport #32107

auto-interval date histogram - 6.x backport #32107

pcsanwald commented Jul 16, 2018

elasticmachine commented Jul 16, 2018

pcsanwald Jul 17, 2018

colings86 Jul 18, 2018

pcsanwald Jul 18, 2018

colings86 Jul 19, 2018

pcsanwald commented Jul 29, 2018

colings86 left a comment

auto-interval date histogram - 6.x backport #32107

auto-interval date histogram - 6.x backport #32107

Conversation

pcsanwald commented Jul 16, 2018

elasticmachine commented Jul 16, 2018

pcsanwald Jul 17, 2018

Choose a reason for hiding this comment

colings86 Jul 18, 2018

Choose a reason for hiding this comment

pcsanwald Jul 18, 2018

Choose a reason for hiding this comment

colings86 Jul 19, 2018

Choose a reason for hiding this comment

pcsanwald commented Jul 29, 2018

colings86 left a comment

Choose a reason for hiding this comment