Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

duplicate series: deduplicate + fix runtime normalization #1855

Merged
merged 8 commits into from
Jul 6, 2020

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Jul 3, 2020

In this PR, I introduce a fairly trivial deduplication optimization, followed by a good chunk of developer documentation to clarify our use of request types and where the sources of duplication and reuse are (as well as opportunities for deduplication), finally arriving at a clearer, more precise explanation of the input data reuse (which all boils down to the dataMap), and finally fixing bug 1807.

fix #1807

Dieterbe added 6 commits July 3, 2020 08:12
it's trivial to demonstrate that before this change, a query like
target=foo&target=foo results in duplicates in plan.Reqs, which
leads to duplicates in ReqMap during executePlan().
apply COW when altering points slice during runtime normalization
fix #1807
@Dieterbe Dieterbe requested review from shanson7 and robert-milan July 3, 2020 10:02
@@ -12,3 +16,17 @@ var pointSlicePool *sync.Pool
func Pool(p *sync.Pool) {
pointSlicePool = p
}

// pointSlicePoolGet returns a pointslice of at least minCap capacity.
// similar code lives also in api.Fix(). at some point we should really clean up our pool code.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that the pool is created in the api package and passed into the expr package.
the original idea was to make the expr library reusable in different software, hence the need to pass the pool into it.

But it's starting to make more sense to just have one global pool singleton with a couple methods and directly access that everywhere. Out of scope for this PR though

Copy link
Collaborator

@shanson7 shanson7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I did a more in-depth (WIP) version of this here: master...bloomberg:fetch_resolve_opts

It handles the cases or Req being functionally identical for fetching. Not sure if it's worth it.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 3, 2020

It handles the cases or Req being functionally identical for fetching. Not sure if it's worth it.

any chance this can be rebased to reduce the diff?

I think the remaining possible deduplications all come with complications. e.g. trying to dedup reqs after the index lookups or after planRequests means having to go through many more reqs. possibly tens of thousands or more. lots of work for something that only rarely pays off. deduping the requested targets or on the AST (query subexpressions) has to process less data, at least.

Dieterbe and others added 2 commits July 3, 2020 14:24
Co-authored-by: Sean Hanson <shanson7@bloomberg.net>
any further attempts would likely yield the same result
Copy link
Collaborator

@shanson7 shanson7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@shanson7
Copy link
Collaborator

shanson7 commented Jul 3, 2020

any chance this can be rebased to reduce the diff?

Probably with a bit of work, if we want to investigate that route.

trying to dedup reqs after the index lookups or after planRequests means having to go through many more reqs. possibly tens of thousands or more

True, but I think we are already doing a lot of logic that is iterating over the result series. If the lookup is quick deduping should be fast (especially when compared with fetching data from external sources).

@Dieterbe Dieterbe merged commit 469a47d into master Jul 6, 2020
@Dieterbe Dieterbe deleted the fixes-for-duplicate-series branch July 6, 2020 05:17
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 6, 2020

True, but I think we are already doing a lot of logic that is iterating over the result series. If the lookup is quick deduping should be fast (especially when compared with fetching data from external sources).

FWIW also in planRequests we already iterate over all reqs (more than once if max-points-per-req is enabled and/or breached), and meta.RenderStats.PointsFetch = rp.PointsFetch().
There may be some payoff for post-index-lookup deduping (e.g. target=foo.*&target=foo.bar) but I think common subexpressions are more likely and should be higher bang for buck.

@shanson7
Copy link
Collaborator

shanson7 commented Jul 6, 2020

I think common subexpressions are more likely

Can you give an example?

The one that I see most frequently is something like target=foo&target=sum(foo) for dashboards where the total and individual series are graphed together.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 6, 2020

Can you give an example?

You know I might just be quite off here.
I don't have evidence or empirical data right now to favor the common subexpression case over the overlapping target expressions case. I think one case of overlapping expressions I saw somewhere recently was something like asPercent(foo.bar,sum(foo.*)), but I can't remember a common subexpr case.

PS in your example foo would expand to multiple series on both targets right? So to avoid confusion, maybe better write it as
target=foo.*&target=sum(foo.*) (which could be solved with both common subexpression dedup as well as deduping individual requests)

@shanson7
Copy link
Collaborator

shanson7 commented Jul 6, 2020

I don't have evidence or empirical data right now

Me neither. Especially with tag queries, it becomes difficult to detect the overlapping without instrumenting the code. At the time of writing the dedup series resolve code I didn't think it was worth it since the dedup was very inexpensive and I knew of a couple large use cases that would benefit.

PS in your example foo would expand to multiple series on both targets right? So to avoid confusion, maybe better write it as target=foo.*&target=sum(foo.*)

I was just using the example from your comment, but yes target=foo.*&target=sum(foo.*) in this case.

which could be solved with both common subexpression dedup as well as deduping individual requests

But PNGroup would need to be excluded from consideration after planning in the case that it didn't need to adjust the fetch interval.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Identical series causes double consolidation
2 participants