Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Drop collection of old realtime capture requests #15422

Conversation

agrare
Copy link
Member

@agrare agrare commented Jun 21, 2017

If the realtime capture request is many days old due to stale data on the queue then don't process it. The perf_capture_timer will re-queue historical gap collection and current realtime captures.

We've seen cases where we are processing realtime capture requests for a 20min window that are over a week old causing the provider to return many days of performance data which causes #perf_process to timeout.

Example:

Capture for ContainerNode start_time: [2017-06-03 21:24:20 UTC], end_time: [2017-06-03 21:54:17 UTC]
expected to get data as of [2017-06-03T21:24:40Z], but got data as of [2017-06-07 09:52:40 UTC].
Processing for ContainerNode for range [2017-06-07 09:52:40 UTC - 2017-06-14 01:42:40 UTC]...
Processing 26930 performance rows...
Message id: [1000031891465], timed out after 600.122446242 seconds.  Timeout threshold [600]

In this case we requested 20 minutes of performance data from 11 days prior, and we actually got about 1 week and 27,000 rows where a few hundred rows are typical.

/cc @blomquisg

If the realtime capture request is many days old due to stale data on
the queue then don't process it.  The perf_capture_timer will re-queue
historical gap collection and current realtime captures.
@miq-bot
Copy link
Member

miq-bot commented Jun 21, 2017

Checked commit agrare@7237eed with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks fine. 🏆

@agrare
Copy link
Member Author

agrare commented Jun 21, 2017

Another option is to allow the collection to go through but to drop it if the actual date range doesn't overlap at all with the requested data range. The collection in this case took 1.5min so dropping it early might be better.

@Fryguy
Copy link
Member

Fryguy commented Jul 5, 2017

@kbrock Please review.

@Fryguy Fryguy requested a review from kbrock July 5, 2017 17:53
@agrare
Copy link
Member Author

agrare commented Jul 5, 2017

For the record, the root cause of why the huge date range was returned appears to be ManageIQ/manageiq-providers-kubernetes#49

@durandom
Copy link
Member

durandom commented Jul 6, 2017

The perf_capture_timer will re-queue historical gap collection and current realtime captures.

Does this mean we still get historical correct data for doing e.g chargeback?
If so, why would you collect realtime data at all that is so old? Cant we get this threshold from the rollup timer?

Copy link
Member

@durandom durandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally ok with this approach, though I dont know why it should be configurable - see my last comment

@agrare
Copy link
Member Author

agrare commented Jul 6, 2017

If so, why would you collect realtime data at all that is so old

This was the result of having a huge backlog of realtime captures on the queue, so by the time we got to a realtime request it was almost two weeks old.

Cant we get this threshold from the rollup timer?

I'd love to not add new config for this if one already exists, can you point me to where this is?

@durandom
Copy link
Member

durandom commented Jul 6, 2017

afai can tell https://github.com/ManageIQ/manageiq/blob/master/config/settings.yml#L89
i.e. we roll up once a day a day worth of metrics.
But maybe @chrisarcand or @kbrock knows more

@agrare
Copy link
Member Author

agrare commented Jul 6, 2017

👎 for parsing crontab strings 😆

@chrisarcand
Copy link
Member

But maybe @chrisarcand or @kbrock knows more

I'm Jon Snow.

@kbrock
Copy link
Member

kbrock commented Jul 17, 2017

How do we typically collect metrics data for gaps in the past?

I was under the impression that we only collect granular data and then roll it up to coarse data vs collecting coarse data.

I'd prefer to just hardcode a threshold at 1 day and not collect anything older. (yes, it may cause issues, but it feels like we should solve metrics rather than letting it stay fluffy and troublesome.)
This of course assumes that we can collect coarse grained data for older dates.

@agrare
Copy link
Member Author

agrare commented Jul 17, 2017

@kbrock normally we split it up into one day intervals (#14332) but in this case these are old realtime captures that were on the queue for a long time. But either way we collect the same data as far as I know.

@agrare
Copy link
Member Author

agrare commented Aug 4, 2017

Closing since ManageIQ/manageiq-providers-kubernetes#49 fixed the root cause of this issue

@agrare agrare closed this Aug 4, 2017
@agrare agrare deleted the prevent_really_old_realtime_capture_requests branch August 14, 2017 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants