Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split metric collections into smaller intervals #14332

Merged
merged 6 commits into from
Mar 21, 2017
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 20 additions & 13 deletions app/models/metric/ci_mixin/capture.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,20 @@ def queue_name_for_metrics_collection
ems.metrics_collector_queue_name
end

def split_capture_intervals(interval_name, start_time, end_time, threshold=1.day)
raise _("Start time must be earlier than End time") if start_time > end_time
# Create an array of ordered pairs from start_time and end_time so that each ordered pair is contained
# within the threshold. Then, reverse it so the newest ordered pair is first:
# start_time = 2017/01/01 12:00:00, end_time = 2017/01/04 12:00:00
# [[interval_name, 2017-01-03 12:00:00 UTC, 2017-01-04 12:00:00 UTC],
# [interval_name, 2017-01-02 12:00:00 UTC, 2017-01-03 12:00:00 UTC],
# [interval_name, 2017-01-01 12:00:00 UTC, 2017-01-02 12:00:00 UTC]]
(start_time.utc..end_time.utc).step_value(threshold).each_cons(2).collect do |s_time, e_time|
[interval_name, s_time, e_time]
end.reverse
end
private :split_capture_intervals

def perf_capture_queue(interval_name, options = {})
start_time = options[:start_time]
end_time = options[:end_time]
Expand All @@ -32,27 +46,20 @@ def perf_capture_queue(interval_name, options = {})

log_target = "#{self.class.name} name: [#{name}], id: [#{id}]"

# Determine what items we should be queuing up
items = []
cb = nil
if interval_name == 'historical'
start_time = Metric::Capture.historical_start_time if start_time.nil?
end_time = Time.now.utc if end_time.nil?

start_hour = start_time
while start_hour != end_time
end_hour = start_hour + 1.day
end_hour = end_time if end_hour > end_time
items.unshift([interval_name, start_hour, end_hour])
start_hour = end_hour
end
else
items << [interval_name]
items[0] << start_time << end_time unless start_time.nil?

start_time = last_perf_capture_on unless start_time
Copy link
Member Author

@blomquisg blomquisg Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The side effect of this is that it's still possible to call target.perf_capture("realtime") on a target with a very old last_perf_capture_on and end up with the bad behavior of attempting to grab many many days of metrics in one go.

This PR's change only helps if the caller uses the queue to queue up the perf capture.

This leads me to think that we need to refactor the metrics capture system past the perf_capture_queue method to always insist that the start_date and end_date are provided. That way, either you know what you're doing from the caller's standpoint, or, you queue up the call and let the queuing mechanism handle the break down of date intervals.

In fact, even trying to catch the problem at the perf_capture level leaves the individual provider implementations with the ability to interpret what it means to have a missing start_date or end_date. It feels very dirty allowing each layer of the metrics capture code make those assumptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was sort of relying on the fact that on vmware, if you ask for anything older than 4 hours for realtime, you are just going to get the last 4 hours. I think it's ok to split it up, because then you should just get a bunch of no-op collections.

This change only helps if the caller uses the queue to queue up the perf capture.

Is there any place that doesn't do this?

So, are you saying that the provider-specific part of perf_capture should be responsible for determining what to do with the timestamps given to them? If so, I thought that's what the code already does (aside from the gap collection).

One other point to note is that for historical collection, we intentionally collected the hours in reverse (that's the .unshift in the original). This was because historical is done in a lower priority. So, it's possible data could come in more slowly and we wanted the charts to grow backwards instead of growing forward but from a time in the way past (if that makes sense). This code undoes that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change only helps if the caller uses the queue to queue up the perf capture.
Is there any place that doesn't do this?

I believe all of the important parts of the code queue today, but there's still some method name building that's hard to track down.

So, are you saying that the provider-specific part of perf_capture should be responsible for determining what to do with the timestamps given to them? If so, I thought that's what the code already does (aside from the gap collection).

No, I'm saying that there are provider-specific implementations that still expect that the caller can send in nil for start/end dates. And, I think the interface should be more strict to assume that the caller always provides a valid start/end date. Instead of allowing the assumption of what a nil start/end date means permeate throughout the metrics capture code. In the end, if no one is calling specific implementations with nil start/end date, then it's wasted logic. And, if people are calling specific implementations with nil start/end date, then it's a pattern of deviation that will show up as we consolidate more things into the provider platform arena.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other point to note is that for historical collection, we intentionally collected the hours in reverse (that's the .unshift in the original). This was because historical is done in a lower priority. So, it's possible data could come in more slowly and we wanted the charts to grow backwards instead of growing forward but from a time in the way past (if that makes sense). This code undoes that.

D'oh, I'll go over that one more time, then. I think I can just reverse what I've got now with the steps and each_cons bits.

end_time = Time.now.utc unless end_time
cb = {:class_name => self.class.name, :instance_id => id, :method_name => :perf_capture_callback, :args => [[task_id]]} if task_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause a problem if we have multiple queue items calling the same callback now?

I can't really understand the code on line 79 below.

Copy link
Member Author

@blomquisg blomquisg Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


NM, I misread.

Lemme look through that callback to see how that's being handled.

end

# Determine what items we should be queuing up
items = [interval_name]
items = split_capture_intervals(interval_name, start_time, end_time) if start_time

# Queue up the actual items
queue_item = {
:class_name => self.class.name,
Expand Down
42 changes: 42 additions & 0 deletions spec/models/metric/ci_mixin/capture_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,48 @@ def expected_stats_period_end
parse_datetime('2013-08-28T12:41:40Z')
end

context "#perf_capture_queue" do
def test_perf_capture_queue(time_since_last_perf_capture, total_queue_items, verify_queue_items_count)
# There are usually some lingering queue items from creating the provider above. Notably `stop_event_monitor`
MiqQueue.delete_all
Timecop.freeze do
start_time = (Time.now.utc - time_since_last_perf_capture)
@vm.last_perf_capture_on = start_time
@vm.perf_capture_queue("realtime")
expect(MiqQueue.count).to eq total_queue_items

# make sure the queue items are in the correct order
queue_items = MiqQueue.order(:id).limit(verify_queue_items_count)
days_ago = (time_since_last_perf_capture.to_i / 1.day.to_i).days
partial_days = time_since_last_perf_capture - days_ago
interval_start_time = (start_time + days_ago).utc
interval_end_time = (interval_start_time + partial_days).utc
queue_items.each do |q_item|
q_start_time, q_end_time = q_item.args
expect(q_start_time).to be_same_time_as interval_start_time
expect(q_end_time).to be_same_time_as interval_end_time
interval_end_time = interval_start_time
# if the collection threshold is ever parameterized, then this 1.day will have to change
interval_start_time = interval_start_time - 1.day
end
end
end

it "splits up long perf_capture durations for old last_perf_capture_on" do
# test when last perf capture was many days ago
# total queue items == 11
# verify last 3 queue items
test_perf_capture_queue(10.days + 5.hours + 23.minutes, 11, 3)
end

it "does not get confused when dealing with a single day" do
# test when perf capture is just a few hours ago
# total queue items == 1
# verify last 1 queue item
test_perf_capture_queue(0.days + 2.hours + 5.minutes, 1, 1)
end
end

context "2 collection periods total, end of 1. period has incomplete stat" do
###################################################################################################################
# DESCRIPTION FOR: net_usage_rate_average
Expand Down