Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

alignRequests is too coarse grained (for timeshift functions etc) #926

Closed
Dieterbe opened this issue May 29, 2018 · 9 comments
Closed

alignRequests is too coarse grained (for timeshift functions etc) #926

Dieterbe opened this issue May 29, 2018 · 9 comments
Assignees
Milestone

Comments

@Dieterbe
Copy link
Contributor

Dieterbe commented May 29, 2018

it aligns the step across all series,

however:
dieter6:58 PM
@dcech can you confirm that in graphite you can render?target=A&target=B and get a response with A and B in different steps (whether A is timeshifted or just a metric from a different retention policy)

dcech6:58 PM
yes, you can
6:59 PM
graphite doesn't do anything to align steps unless you use a query to aggregate series together
6:59 PM
you can do target=a* and get back series with different steps if they have different retentions etc

@Dieterbe
Copy link
Contributor Author

seems like plan.Reqs should be bundling reqs together based on whether they get used together,
and then call alignRequests for each step

@Dieterbe
Copy link
Contributor Author

related: #463

@Dieterbe
Copy link
Contributor Author

Dieterbe commented May 30, 2018

for the record, I can confirm this behavior in graphite

[aa]
pattern = aa
retentions = 5s:1d

[default]
pattern = .*
retentions = 1s:1d

http://localhost/render/?target=a*&from=-1min&height=800&width=1500
download

@Dieterbe
Copy link
Contributor Author

pulling in some relevant quotes from slack convo:

shanson2:59 PM
So, it seems like there are multiple ways this could play out still.
1 Don’t do any alignment. Functions that need to combine series will need to call some normalize helper (this is what graphite does)
2 Align per target (just in case they get aggregated together)
3 More in-depth analysis during plan to figure out if they will later be aggregated and group them together for alignment

dieter4:19 PM
somewhat related: the logic that was switched around (reading lower res data from storage if we know we're gonna hit maxdatapoints) in #463 can be re-introduced for targets that don't do any processing (or do minimal processing like just a sumSeries)

dieter4:22 AM
2 is a no go because it's not the graphite way.
so it's really 1 vs 3. i still like the up front alignment step because it allows to do optimizations like fetch from lower res archives provided we are 100% sure we're not doing any processing on top of the data (otherwise we run into #463 again)

shanson9:53 AM
True. Although things like groupByTag would be quite hard to support up front (edited)
9:54 AM
But we could always flag any call to an aggregate func as optimizable / alignable

dieter11:00 AM
but you can have nested function calls involving aggregate and others (say, summarize). so I think functions should be able to mark as not optimizable (such as summarize), and also if the request comes from a graphite server, it should be treated as not-optimizable
11:03 AM
groupByTags would also mark as non-optimizable
11:03 AM
anyway, we can make this optimization later, more important is to make alignRequests work properly

shanson11:08 AM
right
11:09 AM
I imagine it would be something in the context, innermost functions winning

dieter11:13 AM
hmm we're on different pages. sum(summarize(...)) or summarize(sum(...)) , neither of these should be optizimed, but just sum() can be

shanson11:19 AM
why?
11:20 AM
after sum is evaluated, they will have already been normalized

dieter11:23 AM
ah you're right

shanson12:06 PM
I am a bit worried about the performance implication of starting with (1) and working toward (3) in time
12:07 PM
I’m wondering if that’s why we see performance improvements by not going to graphite
12:07 PM

if request.NoProxy {
(edited)
12:09 PM
Hmm, seems that’s unrelated

dieter12:14 PM
my plan is to shoot for 3 directly

@Dieterbe
Copy link
Contributor Author

according to shanson, timeShift, offset and moving* depend on this fix.

@Dieterbe Dieterbe changed the title alignRequests is too coarse grained alignRequests is too coarse grained (for timeshift functions etc) Oct 7, 2019
@Dieterbe Dieterbe modified the milestones: vnext, sprint-2 Oct 7, 2019
@fkaleo fkaleo modified the milestones: sprint-2, sprint-4 Oct 28, 2019
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 28, 2019

This is my current thinking. Let me introduce a bunch of new terms first....
@shanson7 curious to hear your thoughts on this.

Greedy-resolution functions

A Greedy-resolution function (GR-function) is a certain processing function that requires, or may require, high resolution data input to do their computations, even if their output will be consolidated down (due to maxDatapoints setting)
For example summarize().
For these, we should return as high-resolution data as we can. (see #463)

MDP-optimizable

MDP-optimizable aka maxDataPoints-optimizable is a data request where we can safely fetch lower precision data by taking into account MaxDataPoints-based consolidation that will take place after function processing.
A data request is MDP-optimizable if we know for sure that it won't be subjected to GR-functions.
I.O.W. when both of these conditions are true:

  • the client was an end-user, not Graphite (Graphite may run any processing, such as GR-functions, without telling us)
  • we (metrictank) will not run GR-functions on this data

What kind of optimizations can we do? Consider this retention rule:

1s:1d,10s:1y
request from=now-2hours to=now, MDP=800
Our options are:

  • 7200 raw (archive 0) datapoints, consolidate aggNum 9, down to 800 (by the way, current code does generate "odd" intervals like 9s in this case)
  • 720 datapoints of archive 1.

While archive 1 is a bit less accurate, it is less data to load, decode, and requires no consolidation. We have a strong suspicion that it is less costly to use this data to satisfy the request.

This is a more refined solution of #463.
In the past, we MDP-optimized every request, which led to incorrect data when fed into GR-functions.
We corrected it by turning off all MDP-optimizations, which I think led to increased latencies, though we don't have the stats off-hand.
The hope is by re-introducing MDP-optimizations the correct way, we can speed up many requests again.

Interval-altering function

Certain functions will return output series in an interval different from the input interval.
For example summarize() and smartSummarize(). We refer to these as IA-functions below.
In principle we can predict what the output interval will be during the plan phase, because we can parse the function arguments.
However, for simplicty, we don't implement this.

Transparent aggregation

A trans-aggregation is a processing function that aggregates multiple series together in a predictable way (known at planning time, before fetching the data).
E.g. sumSeries, averageSeries are known to always aggregate all their inputs together.

Opaque aggregation

An opaque-aggregation is a processing function where we cannot accurately predict which series will be aggregated together
because it depends on information (e.g. names, tags) that will only be known at runtime. (e.g. groupByTags, groupByNode(s))

Pre-normalizable

when data will be used together (e.g. aggregating multiple series together) they will need to have the same resolution.
(note that generally, series do not need to have the same resolution. We have been aligning resolutions much too aggressively. see #926)
An aggregation can be opaque or transparent as defined above.

Pre-normalizing is when we can safely - during planning - set up normalization to happen right after fetching (or better: set up the fetch parameters such that normalizing is not needed)
This is the case when series go from fetching to transparent aggregation, possibly with some processing functions - except opaque aggregation(s) or IA-function(s) - in between.

For example if we have these schemas:

series A: 1s:1d,10s:1y
series B: 10s:1d

Let's say the initial fetch parameters are to get the raw data for both A and B.
If we know that these series will be aggregated together, they will need to be normalized, meaning A will need to be at 10s resolution.
If the query is sum(A,B) or sum(perSecond(A),B) we can safely pre-normalize, specifically, we can fetch the first rollup of series A, rather than fetching the raw data
and then normalizing (consolidating) at runtime - and thus spend less resources - because we know for sure that having the coarser data for A will not cause trouble in this pipeline.
However, if the query is sum(A, summarize(B,...)) we cannot safely do this as we don't have a prediction of what the output interval of summarize(B,...) will be.
Likewise, if the query is groupByNode(group(A,B), 2, callback='sum') we cannot predict whether A and B will end up in the same group, and thus should be normalized.

Proposed changes

  1. Don't align any requests (as in models.Req) up front
  2. Make sure all aggregation functions can normalize at runtime (if necessary), otherwise they will fail to process multiple input series that now can have different intervals
  3. Implement pre-normalization
  4. While we're at it, may as well implement MDP-optimization
  5. (planning-stage awareness of the output interval of IA-functions, which means we can set up fetching/(pre)normalization in a smarter way)

Step 1 and 2 will solve our most urgent problem of over-aligning data (#926)
However it will probably (?) leave some performance optimizations on the table. Which step 3 and 4 address. It's unclear how urgent step 3 and 4 are, though they aren't too difficult to implement.
Implementing both of them can probably be done in one shot, as solving them is done in a similar way.

Note:

  1. step 1 has a complication:
    since we no longer set up all models.Req for a request at once, it's trickier to implement "max-points-per-req-soft" and "max-points-per-req-hard".
    Think of it this way: if the number of datapoints fetched in a request is higher than the soft limit, which series should we fetch in a more coarse way? Previously, all targets were aligned to the same interval, and would be bumped to coarser resolutions all together. Now, we have different groups of targets at possibly different resolutions, each of which could be independently made coarser to accommodate "max-points-per-req-soft".
    We can implement a heuristic that tries to pick series that:

    • are highest resolution compared to other series in the request - trying to avoid those series that will be processed by a GR-function - though if we have no other choice, we will do it, as we currently do.
    • have normalization applied to them. bumping those to the next rollup could solve the max-points-per-req constraint and also remove inefficiencies of fetching higher resolution data and consolidating it down
      We keep making fetches coarser until either of these happened:
    • we will fetch the coarsest archives of all series
    • we are within the max-points-per-req-soft constraint.
      Then we compare against the hard limit and bail out if the limit is breached. Similar to how we currently do it.
      Note that to keep the support for the max-points-per-req-soft/max-points-per-req-hard setting we have to implement tracking of GR-functions which means we can probably just do step 4 while we're at it.
  2. Step 5 is something that can wait.

  3. our current code doesn't take IA-functions into account at all as it sets up all series to the same interval before feeding into the processing pipeline. this probably leads bugs.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Dec 1, 2019

An interesting addendum to pre-normalizations:
first, PNC group = pre-normalization candidate groups.
We need to identify which requests meet the pre-normalization criteria and will be transparent-aggregagated together
This brings up the interesting question: what if the same series are used multiple times?
e.g. target=foo&target=sum(foo,bar)
foo and bar form a PNC group, but if we pre-normalize foo (fetch lower resolution data), we should not impact target=foo.
Another example:
target=sum(a,b)&target=avg(b,c)
if a and c have different resolutions, things should work as expected (a should not affect the 2nd pre-normalization and vice versa for c).

Both of these examples may result in fetching different archives for basically the same request (meaning series and to/from)
This could be a deficiency, or it may be more optimal. depends on what's active in the chunk cache.
(better to runtime normalize data from the cache than cold fetching data even if it's at the desired interval already). I think for now, if we implement pre-normalization, we should mainly consider correctness, not worry about performance.

@Dieterbe
Copy link
Contributor Author

something else i realized: a side benefit is that inaccuracies due to honoring max-points-per-req are reduced because we can coarsen PNGroups and MDP-optimizable requests first.

@replay replay modified the milestones: sprint-6, sprint-7 Jan 27, 2020
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Feb 6, 2020

fixed by #951

@Dieterbe Dieterbe closed this as completed Feb 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants