[WIP] Drop collection of old realtime capture requests #15422

agrare · 2017-06-21T18:19:18Z

If the realtime capture request is many days old due to stale data on the queue then don't process it. The perf_capture_timer will re-queue historical gap collection and current realtime captures.

We've seen cases where we are processing realtime capture requests for a 20min window that are over a week old causing the provider to return many days of performance data which causes #perf_process to timeout.

Example:

Capture for ContainerNode start_time: [2017-06-03 21:24:20 UTC], end_time: [2017-06-03 21:54:17 UTC]
expected to get data as of [2017-06-03T21:24:40Z], but got data as of [2017-06-07 09:52:40 UTC].
Processing for ContainerNode for range [2017-06-07 09:52:40 UTC - 2017-06-14 01:42:40 UTC]...
Processing 26930 performance rows...
Message id: [1000031891465], timed out after 600.122446242 seconds.  Timeout threshold [600]

In this case we requested 20 minutes of performance data from 11 days prior, and we actually got about 1 week and 27,000 rows where a few hundred rows are typical.

/cc @blomquisg

If the realtime capture request is many days old due to stale data on the queue then don't process it. The perf_capture_timer will re-queue historical gap collection and current realtime captures.

miq-bot · 2017-06-21T18:27:55Z

Checked commit agrare@7237eed with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks fine. 🏆

agrare · 2017-06-21T18:28:26Z

Another option is to allow the collection to go through but to drop it if the actual date range doesn't overlap at all with the requested data range. The collection in this case took 1.5min so dropping it early might be better.

Fryguy · 2017-07-05T17:53:41Z

@kbrock Please review.

agrare · 2017-07-05T18:03:38Z

For the record, the root cause of why the huge date range was returned appears to be ManageIQ/manageiq-providers-kubernetes#49

durandom · 2017-07-06T12:51:18Z

The perf_capture_timer will re-queue historical gap collection and current realtime captures.

Does this mean we still get historical correct data for doing e.g chargeback?
If so, why would you collect realtime data at all that is so old? Cant we get this threshold from the rollup timer?

durandom

generally ok with this approach, though I dont know why it should be configurable - see my last comment

agrare · 2017-07-06T13:03:23Z

If so, why would you collect realtime data at all that is so old

This was the result of having a huge backlog of realtime captures on the queue, so by the time we got to a realtime request it was almost two weeks old.

Cant we get this threshold from the rollup timer?

I'd love to not add new config for this if one already exists, can you point me to where this is?

durandom · 2017-07-06T13:21:53Z

afai can tell https://github.com/ManageIQ/manageiq/blob/master/config/settings.yml#L89
i.e. we roll up once a day a day worth of metrics.
But maybe @chrisarcand or @kbrock knows more

agrare · 2017-07-06T13:23:37Z

👎 for parsing crontab strings 😆

chrisarcand · 2017-07-06T14:14:19Z

But maybe @chrisarcand or @kbrock knows more

I'm Jon Snow.

kbrock · 2017-07-17T14:26:38Z

How do we typically collect metrics data for gaps in the past?

I was under the impression that we only collect granular data and then roll it up to coarse data vs collecting coarse data.

I'd prefer to just hardcode a threshold at 1 day and not collect anything older. (yes, it may cause issues, but it feels like we should solve metrics rather than letting it stay fluffy and troublesome.)
This of course assumes that we can collect coarse grained data for older dates.

agrare · 2017-07-17T14:44:46Z

@kbrock normally we split it up into one day intervals (#14332) but in this case these are old realtime captures that were on the queue for a long time. But either way we collect the same data as far as I know.

agrare · 2017-08-04T14:19:25Z

Closing since ManageIQ/manageiq-providers-kubernetes#49 fixed the root cause of this issue

Drop collection of old realtime capture requests

7237eed

If the realtime capture request is many days old due to stale data on the queue then don't process it. The perf_capture_timer will re-queue historical gap collection and current realtime captures.

agrare added bug providers/metrics wip labels Jun 21, 2017

chessbyte assigned blomquisg Jul 5, 2017

chessbyte requested review from Fryguy, durandom and chrisarcand July 5, 2017 14:03

Fryguy requested a review from kbrock July 5, 2017 17:53

durandom reviewed Jul 6, 2017

View reviewed changes

agrare closed this Aug 4, 2017

agrare deleted the prevent_really_old_realtime_capture_requests branch August 14, 2017 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Drop collection of old realtime capture requests #15422

[WIP] Drop collection of old realtime capture requests #15422

agrare commented Jun 21, 2017

miq-bot commented Jun 21, 2017

agrare commented Jun 21, 2017

Fryguy commented Jul 5, 2017

agrare commented Jul 5, 2017

durandom commented Jul 6, 2017

durandom left a comment

agrare commented Jul 6, 2017

durandom commented Jul 6, 2017

agrare commented Jul 6, 2017

chrisarcand commented Jul 6, 2017

kbrock commented Jul 17, 2017 •

edited

Loading

agrare commented Jul 17, 2017

agrare commented Aug 4, 2017

[WIP] Drop collection of old realtime capture requests #15422

[WIP] Drop collection of old realtime capture requests #15422

Conversation

agrare commented Jun 21, 2017

miq-bot commented Jun 21, 2017

agrare commented Jun 21, 2017

Fryguy commented Jul 5, 2017

agrare commented Jul 5, 2017

durandom commented Jul 6, 2017

durandom left a comment

Choose a reason for hiding this comment

agrare commented Jul 6, 2017

durandom commented Jul 6, 2017

agrare commented Jul 6, 2017

chrisarcand commented Jul 6, 2017

kbrock commented Jul 17, 2017 • edited Loading

agrare commented Jul 17, 2017

agrare commented Aug 4, 2017

kbrock commented Jul 17, 2017 •

edited

Loading