-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split metric collections into smaller intervals #14332
Conversation
items << [interval_name] | ||
items[0] << start_time << end_time unless start_time.nil? | ||
|
||
start_time = last_perf_capture_on unless start_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The side effect of this is that it's still possible to call target.perf_capture("realtime")
on a target with a very old last_perf_capture_on
and end up with the bad behavior of attempting to grab many many days of metrics in one go.
This PR's change only helps if the caller uses the queue to queue up the perf capture.
This leads me to think that we need to refactor the metrics capture system past the perf_capture_queue
method to always insist that the start_date
and end_date
are provided. That way, either you know what you're doing from the caller's standpoint, or, you queue up the call and let the queuing mechanism handle the break down of date intervals.
In fact, even trying to catch the problem at the perf_capture
level leaves the individual provider implementations with the ability to interpret what it means to have a missing start_date
or end_date
. It feels very dirty allowing each layer of the metrics capture code make those assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was sort of relying on the fact that on vmware, if you ask for anything older than 4 hours for realtime, you are just going to get the last 4 hours. I think it's ok to split it up, because then you should just get a bunch of no-op collections.
This change only helps if the caller uses the queue to queue up the perf capture.
Is there any place that doesn't do this?
So, are you saying that the provider-specific part of perf_capture should be responsible for determining what to do with the timestamps given to them? If so, I thought that's what the code already does (aside from the gap collection).
One other point to note is that for historical collection, we intentionally collected the hours in reverse (that's the .unshift in the original). This was because historical is done in a lower priority. So, it's possible data could come in more slowly and we wanted the charts to grow backwards instead of growing forward but from a time in the way past (if that makes sense). This code undoes that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change only helps if the caller uses the queue to queue up the perf capture.
Is there any place that doesn't do this?
I believe all of the important parts of the code queue today, but there's still some method name building that's hard to track down.
So, are you saying that the provider-specific part of perf_capture should be responsible for determining what to do with the timestamps given to them? If so, I thought that's what the code already does (aside from the gap collection).
No, I'm saying that there are provider-specific implementations that still expect that the caller can send in nil
for start/end dates. And, I think the interface should be more strict to assume that the caller always provides a valid start/end date. Instead of allowing the assumption of what a nil
start/end date means permeate throughout the metrics capture code. In the end, if no one is calling specific implementations with nil
start/end date, then it's wasted logic. And, if people are calling specific implementations with nil
start/end date, then it's a pattern of deviation that will show up as we consolidate more things into the provider platform arena.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other point to note is that for historical collection, we intentionally collected the hours in reverse (that's the .unshift in the original). This was because historical is done in a lower priority. So, it's possible data could come in more slowly and we wanted the charts to grow backwards instead of growing forward but from a time in the way past (if that makes sense). This code undoes that.
D'oh, I'll go over that one more time, then. I think I can just reverse what I've got now with the steps and each_cons bits.
Nevermind...looks like you did. I just misread the code. |
intervals = [] | ||
start_hour = start_time | ||
while start_hour != end_time | ||
#TODO: replace 1.day with a configurable threshold defined by provider type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh ... follow-up PR if anything.
It's possible that some providers will gather so much data every hour that attempting to even get just one day would be too much data. Dunno.
intervals = [] | ||
start_hour = start_time | ||
while start_hour != end_time | ||
#TODO: replace 1.day with a configurable threshold defined by provider type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make it a parameter?
Yeah, it's the same logic ... and it's DRYish. The |
end_hour = end_time if end_hour > end_time | ||
intervals << [start_hour, end_hour] | ||
start_hour = end_hour | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels very manual. I think you can do this with
(start_time..end_time).step_value(1.day)
If you definitely need them in start/end pairs, just tack on a each_cons(2)
(start_time..end_time).step_value(1.day).each_cons(2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also feels very "copy'n'paste" 😄
I can update it to something prettier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL I didn't see the original, which was why I thought you were writing something new. Heh, yeah, now that we have step_value method, it's a LOT nicer :)
@@ -36,6 +36,16 @@ def expected_stats_period_end | |||
parse_datetime('2013-08-28T12:41:40Z') | |||
end | |||
|
|||
context "#perf_capture_queue" do | |||
it "splits up long perf_capture durations for old last_perf_capture_on" do | |||
# there shoud be 10 days + a partial day, or 11 queue items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment typo shoud
-> should
def split_capture_intervals(start_time, end_time) | ||
intervals = [] | ||
start_hour = start_time | ||
while start_hour != end_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any validation on the end_time? Looks like this could loop infinitely if start_time and end_time aren't aligned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should definitely be a <
. Or just use step_value as mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
step_value
ftw
406d016
to
9e06dbe
Compare
# [[2017-01-03 12:00:00 UTC, 2017-01-04 12:00:00 UTC], | ||
# [2017-01-02 12:00:00 UTC, 2017-01-03 12:00:00 UTC], | ||
# [2017-01-01 12:00:00 UTC, 2017-01-02 12:00:00 UTC]] | ||
(start_time..end_time).step_value(threshold).each_cons(2).collect { |s_time, e_time| [s_time, e_time] }.reverse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the collect bit necessary? .each_cons already returns that format, so curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried it, each_cons
returned a single array:
> (1..20).step_value(2).each_cons(2)
=> #<Enumerator: [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 20]:each_cons(2)>
Whereas the collect
will group them into the tuples:
> (1..20).step_value(2).each_cons(2).collect {|s, e| [s, e]}
=> [[1, 3], [3, 5], [5, 7], [7, 9], [9, 11], [11, 13], [13, 15], [15, 17], [17, 19], [19, 20]]
And, reverse gets the ordering correct:
> (1..20).step_value(2).each_cons(2).collect {|s, e| [s, e]}.reverse
=> [[19, 20], [17, 19], [15, 17], [13, 15], [11, 13], [9, 11], [7, 9], [5, 7], [3, 5], [1, 3]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I could do is pass in the interval_name
as an argument and forego the extra call to map
below.
Then, the collect
call here really is manipulating the tuples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you read it wrong...The first param is returning an Enumerator over the set, but it hadn't yet applied the each_cons. Calling to_a shows that each_cons by itself works.
> (1..20).step_value(2).each_cons(2).to_a
=> [[1, 3], [3, 5], [5, 7], [7, 9], [9, 11], [11, 13], [13, 15], [15, 17], [17, 19], [19, 20]]
However, TIL you can't call .reverse
on a Enumerator directly. Even so, you can call it on .to_a
> (1..20).step_value(2).each_cons(2).to_a.reverse
=> [[19, 20], [17, 19], [15, 17], [13, 15], [11, 13], [9, 11], [7, 9], [5, 7], [3, 5], [1, 3]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To your point though, if you pass in interval_name, then you just call map directly on each_cons.
> (1..20).step_value(2).each_cons(2).map { |t1, t2| [interval_name, t1, t2] }.reverse
=> [["realtime", 19, 20],
["realtime", 17, 19],
["realtime", 15, 17],
["realtime", 13, 15],
["realtime", 11, 13],
["realtime", 9, 11],
["realtime", 7, 9],
["realtime", 5, 7],
["realtime", 3, 5],
["realtime", 1, 3]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly this! 😄
Done.
cb = {:class_name => self.class.name, :instance_id => id, :method_name => :perf_capture_callback, :args => [[task_id]]} if task_id | ||
end | ||
|
||
# Determine what items we should be queuing up | ||
items = [interval_name] | ||
items = split_capture_intervals(start_time, end_time).map do |start_and_end_time| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this will be cleaner if instead of a single start_and_end_time
, you have two separate, st
and et
variables, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done by virtue of removing this line
original_queue_count = MiqQueue.count | ||
@vm.last_perf_capture_on = 10.days.ago - 5.hours - 23.minutes | ||
@vm.perf_capture_queue("realtime") | ||
expect(MiqQueue.count - original_queue_count).to eq 11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to check some candidate time values and also to see if they are in reverse. You can use Timecop.freeze if you need to ensure specific time values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test added
When a target's `last_perf_capture_on` is far into the past, the next attempt to capture metrics for that target will result in an attempt to gather a potentially huge amount of metrics at one time. This change will break down the intervals between `last_perf_capture_on` and `today` into individual days. This is the same logic that's used when collecting gap metrics. https://bugzilla.redhat.com/show_bug.cgi?id=1429593
Remove a redundant call to `map`, and add a test to make sure that the dates in metrics collection are in a properly reversed order. https://bugzilla.redhat.com/show_bug.cgi?id=1429593
9e06dbe
to
f874607
Compare
items[0] << start_time << end_time unless start_time.nil? | ||
|
||
start_time = last_perf_capture_on unless start_time | ||
end_time = Time.now.utc unless end_time | ||
cb = {:class_name => self.class.name, :instance_id => id, :method_name => :perf_capture_callback, :args => [[task_id]]} if task_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this cause a problem if we have multiple queue items calling the same callback now?
I can't really understand the code on line 79 below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NM, I misread.
Lemme look through that callback to see how that's being handled.
context "#perf_capture_queue" do | ||
it "splits up long perf_capture durations for old last_perf_capture_on" do | ||
# there should be 10 days + a partial day, or 11 queue items | ||
Timecop.freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer Timecop.freeze with a block. This will auto-return at the end even in the case of an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I saw that and changed it in another update ... testing that out locally and I can update in a few minutes
it "splits up long perf_capture durations for old last_perf_capture_on" do | ||
# there should be 10 days + a partial day, or 11 queue items | ||
Timecop.freeze | ||
original_queue_count = MiqQueue.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately, just whack the queue before you start, though I'm very surprised to see queue items already in the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So was I. It's because we queue up stop_event_monitor
when we create OpenStack EMSes. And, it just so happens that all this testing is relying on OpenStack mocks.
I thought about whacking it, but didn't wanna impact anything else. I'm sure no one is relying on those existing queue messages, but ¯\_(ツ)_/¯.
queue_items = MiqQueue.peek(:conditions => {:zone => @vm.my_zone, | ||
:queue_name => @vm.queue_name_for_metrics_collection, | ||
:role => "ems_metrics_collector"}, | ||
:limit => 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you whack the queue items from the beginning, then you don't need to use peek or anything complicated here. Just MiqQueue.order(:id)
should be sufficient to get you the items in order so that you can traverse over them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
queue_items.each do |q_item| | ||
q_start_time, q_end_time = q_item.args | ||
expect(q_start_time).to eq start_time | ||
expect(q_end_time).to eq end_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use be_same_time_as
for the times here for start and end time. It's possible that since we are serializing that there could be a loss of precision which would given sporadic test failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger
# [interval_name, 2017-01-01 12:00:00 UTC, 2017-01-02 12:00:00 UTC]] | ||
start_time = start_time.utc | ||
end_time = end_time.utc | ||
(start_time..end_time).step_value(threshold).each_cons(2).collect do |s_time, e_time| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but I think you can collapse out the .utc calls into the range...
(start_time.utc..end_time.utc).step_value(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, wilco
Ensure that perf_capture_queue works on long intervals and single days.
Checked commits blomquisg/manageiq@5571320~...7bd3090 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0 |
I can't reproduce the same spec failure locally with the same seed. I'll close/reopen and see if that helps. If not, I'll rebase on master and see what I'm up against there. |
FWIW, I rebased locally and tried the same seed and couldn't get the same error on |
Split metric collections into smaller intervals (cherry picked from commit 660e3a3) https://bugzilla.redhat.com/show_bug.cgi?id=1435278
Euwe backport details:
|
Previously, if the `last_perf_capture_on` value was very very old, MIQ would attempt to gather all of the metrics from `last_perf_capture_on` until now. ManageIQ#14332 introduced the concept of grouping the historical metrics by days, so the collector worker would not try to gather *ALL* of the metrics at once. This change further restricts the metrics by favoring the more recent date between: * `last_perf_capture_on` * Metric::Capture.historical_start_time If `last_perf_capture_on` is too far in the past, `Metric::Capture.historical_start_time` will be a forceful override.
…tion Split metric collections into smaller intervals (cherry picked from commit 660e3a3) https://bugzilla.redhat.com/show_bug.cgi?id=1435278
When a target's
last_perf_capture_on
is far into the past, the next attempt to capture metrics for that target will result in an attempt to gather a potentially huge amount of metrics at one time.This change will break down the intervals between
last_perf_capture_on
andtoday
into individual days. This is the same logic that's used when collecting gap metrics.Links
https://bugzilla.redhat.com/show_bug.cgi?id=1429593
https://bugzilla.redhat.com/show_bug.cgi?id=1432719
Steps for Testing/QA
last_perf_capture_on
field to some time in the past.The expected result is the MiqQueue should receive
N
number of queue entries, 1 for each full or partial day sincelast_perf_capture_on
./cc @simon3z