duplicate series: deduplicate + fix runtime normalization #1855

Dieterbe · 2020-07-03T10:02:43Z

In this PR, I introduce a fairly trivial deduplication optimization, followed by a good chunk of developer documentation to clarify our use of request types and where the sources of duplication and reuse are (as well as opportunities for deduplication), finally arriving at a clearer, more precise explanation of the input data reuse (which all boils down to the dataMap), and finally fixing bug 1807.

fix #1807

it's trivial to demonstrate that before this change, a query like target=foo&target=foo results in duplicates in plan.Reqs, which leads to duplicates in ReqMap during executePlan().

apply COW when altering points slice during runtime normalization fix #1807

Dieterbe · 2020-07-03T10:12:21Z

expr/pool.go

@@ -12,3 +16,17 @@ var pointSlicePool *sync.Pool
 func Pool(p *sync.Pool) {
 	pointSlicePool = p
 }
+
+// pointSlicePoolGet returns a pointslice of at least minCap capacity.
+// similar code lives also in api.Fix(). at some point we should really clean up our pool code.


note that the pool is created in the api package and passed into the expr package.
the original idea was to make the expr library reusable in different software, hence the need to pass the pool into it.

But it's starting to make more sense to just have one global pool singleton with a couple methods and directly access that everywhere. Out of scope for this PR though

shanson7

Looks good.

I did a more in-depth (WIP) version of this here: master...bloomberg:fetch_resolve_opts

It handles the cases or Req being functionally identical for fetching. Not sure if it's worth it.

devdocs/expr.md

expr/plan.go

expr/pool.go

Dieterbe · 2020-07-03T11:23:08Z

It handles the cases or Req being functionally identical for fetching. Not sure if it's worth it.

any chance this can be rebased to reduce the diff?

I think the remaining possible deduplications all come with complications. e.g. trying to dedup reqs after the index lookups or after planRequests means having to go through many more reqs. possibly tens of thousands or more. lots of work for something that only rarely pays off. deduping the requested targets or on the AST (query subexpressions) has to process less data, at least.

Co-authored-by: Sean Hanson <shanson7@bloomberg.net>

any further attempts would likely yield the same result

shanson7

Looks good!

shanson7 · 2020-07-03T13:12:08Z

any chance this can be rebased to reduce the diff?

Probably with a bit of work, if we want to investigate that route.

trying to dedup reqs after the index lookups or after planRequests means having to go through many more reqs. possibly tens of thousands or more

True, but I think we are already doing a lot of logic that is iterating over the result series. If the lookup is quick deduping should be fast (especially when compared with fetching data from external sources).

Dieterbe · 2020-07-06T05:22:17Z

True, but I think we are already doing a lot of logic that is iterating over the result series. If the lookup is quick deduping should be fast (especially when compared with fetching data from external sources).

FWIW also in planRequests we already iterate over all reqs (more than once if max-points-per-req is enabled and/or breached), and meta.RenderStats.PointsFetch = rp.PointsFetch().
There may be some payoff for post-index-lookup deduping (e.g. target=foo.*&target=foo.bar) but I think common subexpressions are more likely and should be higher bang for buck.

shanson7 · 2020-07-06T08:10:48Z

I think common subexpressions are more likely

Can you give an example?

The one that I see most frequently is something like target=foo&target=sum(foo) for dashboards where the total and individual series are graphed together.

Dieterbe · 2020-07-06T09:12:47Z

Can you give an example?

You know I might just be quite off here.
I don't have evidence or empirical data right now to favor the common subexpression case over the overlapping target expressions case. I think one case of overlapping expressions I saw somewhere recently was something like asPercent(foo.bar,sum(foo.*)), but I can't remember a common subexpr case.

PS in your example foo would expand to multiple series on both targets right? So to avoid confusion, maybe better write it as
target=foo.*&target=sum(foo.*) (which could be solved with both common subexpression dedup as well as deduping individual requests)

shanson7 · 2020-07-06T09:22:35Z

I don't have evidence or empirical data right now

Me neither. Especially with tag queries, it becomes difficult to detect the overlapping without instrumenting the code. At the time of writing the dedup series resolve code I didn't think it was worth it since the dedup was very inexpensive and I knew of a couple large use cases that would benefit.

PS in your example foo would expand to multiple series on both targets right? So to avoid confusion, maybe better write it as target=foo.*&target=sum(foo.*)

I was just using the example from your comment, but yes target=foo.*&target=sum(foo.*) in this case.

which could be solved with both common subexpression dedup as well as deduping individual requests

But PNGroup would need to be excluded from consideration after planning in the case that it didn't need to adjust the fetch interval.

Dieterbe added 6 commits July 3, 2020 08:12

deduplicate plan.Reqs

8ab1b47

it's trivial to demonstrate that before this change, a query like target=foo&target=foo results in duplicates in plan.Reqs, which leads to duplicates in ReqMap during executePlan().

document render path series (de)duplication and impact on series reuse

97c9b12

ReqMap.Dump() output fix

6dc9056

document series reuse stuff better

7cc4e0a

utility function for the pool to get appropriately sized slice

a3893e7

fix: Identical series causes double consolidation

81d234b

apply COW when altering points slice during runtime normalization fix #1807

Dieterbe requested review from shanson7 and robert-milan July 3, 2020 10:02

Dieterbe commented Jul 3, 2020

View reviewed changes

shanson7 reviewed Jul 3, 2020

View reviewed changes

devdocs/expr.md Outdated Show resolved Hide resolved

expr/plan.go Outdated Show resolved Hide resolved

expr/plan.go Outdated Show resolved Hide resolved

expr/pool.go Outdated Show resolved Hide resolved

Dieterbe and others added 2 commits July 3, 2020 14:24

typo fixes

c7e8a58

Co-authored-by: Sean Hanson <shanson7@bloomberg.net>

only attempt to get from pool once

d870e00

any further attempts would likely yield the same result

shanson7 approved these changes Jul 3, 2020

View reviewed changes

Dieterbe merged commit 469a47d into master Jul 6, 2020

Dieterbe deleted the fixes-for-duplicate-series branch July 6, 2020 05:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate series: deduplicate + fix runtime normalization #1855

duplicate series: deduplicate + fix runtime normalization #1855

Dieterbe commented Jul 3, 2020 •

edited

Loading

Dieterbe Jul 3, 2020

shanson7 left a comment

Dieterbe commented Jul 3, 2020

shanson7 left a comment

shanson7 commented Jul 3, 2020

Dieterbe commented Jul 6, 2020

shanson7 commented Jul 6, 2020

Dieterbe commented Jul 6, 2020 •

edited

Loading

shanson7 commented Jul 6, 2020

duplicate series: deduplicate + fix runtime normalization #1855

duplicate series: deduplicate + fix runtime normalization #1855

Conversation

Dieterbe commented Jul 3, 2020 • edited Loading

Dieterbe Jul 3, 2020

Choose a reason for hiding this comment

shanson7 left a comment

Choose a reason for hiding this comment

Dieterbe commented Jul 3, 2020

shanson7 left a comment

Choose a reason for hiding this comment

shanson7 commented Jul 3, 2020

Dieterbe commented Jul 6, 2020

shanson7 commented Jul 6, 2020

Dieterbe commented Jul 6, 2020 • edited Loading

shanson7 commented Jul 6, 2020

Dieterbe commented Jul 3, 2020 •

edited

Loading

Dieterbe commented Jul 6, 2020 •

edited

Loading