Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Duplicate Put in point slice pool from mergeSeries #1875

Closed
shanson7 opened this issue Aug 10, 2020 · 2 comments · Fixed by #1879
Closed

Duplicate Put in point slice pool from mergeSeries #1875

shanson7 opened this issue Aug 10, 2020 · 2 comments · Fixed by #1879
Labels

Comments

@shanson7
Copy link
Collaborator

Describe the bug

In certain scenarios during a render request a slice can be returned to the pool 2 times. We saw corrupted responses because of this and so deployed a change that (within a single render) detects this case, does not return the slice to the pool twice and logs additional information. With this info I think I've found where it happens. In the (likely rare) case where there are identical time-series with 2 different intervals, mergeSeries will merge them together to form a single normalized value.

It then puts all the other series back into the pool and only keeps the first one. However, normalize already puts the new normalized value into the datamap. So, in the scenario where there is series foo.bar with intervals 60 (let's call this foo60) and 30 (foo30) (in this order), the process flow goes something like:

  1. mergeSeries calls normalize(datamap, [foo60, foo30])
  2. normalize keeps foo60 (nothing to do)
  3. normalize copies foo30, consolidates it to 60 seconds and adds it to the datamap (let's call this foo60_2)
    • Note: At this point we dropped the slice from foo30 to be GC'd. It did not go into the pool
  4. mergeSeries then iterates over the points in foo60 checking if it can fill NaN values from foo60_2
  5. Finally, mergeSeries adds foo60_2.Datapoints into the pool, even though it is already in the datamap!

Helpful Information
Metrictank Version: v1.0-52 (custom build off of commit e085017)
Golang Version: go1.12.7

@shanson7 shanson7 added the bug label Aug 10, 2020
@shanson7
Copy link
Collaborator Author

shanson7 commented Aug 10, 2020

For completeness, the reverse case ([foo30, foo60]) is also an issue:

  1. mergeSeries calls normalize(datamap, [foo30, foo60])
  2. normalize keeps foo60 (nothing to do)
  3. normalize copies foo30, consolidates it to 60 seconds and adds it to the datamap (let's call this foo60_2)
    • Note: At this point we dropped the slice from foo30 to be GC'd. It did not go into the pool
  4. mergeSeries then iterates over the points in foo60_2 checking if it can fill NaN values from foo60_2
  5. mergeSeries adds foo60.Datapoints into the pool (this is ok)
  6. executePlan adds foo60_2 to the datamap for a second time.

This version is the one that the code I added to our rollout is actually detecting. The one in the original description won't be detected and the pool will be corrupted until next clean or until detected.

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 11, 2020

Very nice catch!

The expr.Normalize functions were originally meant to be used within the expr package, inside of a plan.Run() call.
that means they should follow this regimen:

  • any unused series should be left alone (may be referenced or read from later, e.g. different function processing chain)
  • any newly created series requires an entry in datamap, to make sure it'll be reclaimed later (after plan.Run, because the series might need to be included in the response body)

When we added normalization into mergeSeries - in #1674 - we reused the same expr.Normalize functions but from a different call site, prior to plan.Run(), where the expectations are different (where we will still add all our series that we'll use to the datamap, or if we don't need them put them in the pool), and thus that behavior doesn't make sense anymore.

Let me dig a bit more into this and figure out a clean way to resolve this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants