Split metric collections into smaller intervals #14332

blomquisg · 2017-03-14T20:27:27Z

When a target's last_perf_capture_on is far into the past, the next attempt to capture metrics for that target will result in an attempt to gather a potentially huge amount of metrics at one time.

This change will break down the intervals between last_perf_capture_on and today into individual days. This is the same logic that's used when collecting gap metrics.

Links

https://bugzilla.redhat.com/show_bug.cgi?id=1429593
https://bugzilla.redhat.com/show_bug.cgi?id=1432719

Steps for Testing/QA

Take any active VM and manually change the last_perf_capture_on field to some time in the past.
Collect metrics for that VM

The expected result is the MiqQueue should receive N number of queue entries, 1 for each full or partial day since last_perf_capture_on.

/cc @simon3z

blomquisg · 2017-03-14T20:32:10Z

app/models/metric/ci_mixin/capture.rb

-      items << [interval_name]
-      items[0] << start_time << end_time unless start_time.nil?
-
+      start_time = last_perf_capture_on unless start_time


The side effect of this is that it's still possible to call target.perf_capture("realtime") on a target with a very old last_perf_capture_on and end up with the bad behavior of attempting to grab many many days of metrics in one go.

This PR's change only helps if the caller uses the queue to queue up the perf capture.

This leads me to think that we need to refactor the metrics capture system past the perf_capture_queue method to always insist that the start_date and end_date are provided. That way, either you know what you're doing from the caller's standpoint, or, you queue up the call and let the queuing mechanism handle the break down of date intervals.

In fact, even trying to catch the problem at the perf_capture level leaves the individual provider implementations with the ability to interpret what it means to have a missing start_date or end_date. It feels very dirty allowing each layer of the metrics capture code make those assumptions.

This code was sort of relying on the fact that on vmware, if you ask for anything older than 4 hours for realtime, you are just going to get the last 4 hours. I think it's ok to split it up, because then you should just get a bunch of no-op collections.

This change only helps if the caller uses the queue to queue up the perf capture.

Is there any place that doesn't do this?

So, are you saying that the provider-specific part of perf_capture should be responsible for determining what to do with the timestamps given to them? If so, I thought that's what the code already does (aside from the gap collection).

One other point to note is that for historical collection, we intentionally collected the hours in reverse (that's the .unshift in the original). This was because historical is done in a lower priority. So, it's possible data could come in more slowly and we wanted the charts to grow backwards instead of growing forward but from a time in the way past (if that makes sense). This code undoes that.

This change only helps if the caller uses the queue to queue up the perf capture.
Is there any place that doesn't do this?

I believe all of the important parts of the code queue today, but there's still some method name building that's hard to track down.

So, are you saying that the provider-specific part of perf_capture should be responsible for determining what to do with the timestamps given to them? If so, I thought that's what the code already does (aside from the gap collection).

No, I'm saying that there are provider-specific implementations that still expect that the caller can send in nil for start/end dates. And, I think the interface should be more strict to assume that the caller always provides a valid start/end date. Instead of allowing the assumption of what a nil start/end date means permeate throughout the metrics capture code. In the end, if no one is calling specific implementations with nil start/end date, then it's wasted logic. And, if people are calling specific implementations with nil start/end date, then it's a pattern of deviation that will show up as we consolidate more things into the provider platform arena.

One other point to note is that for historical collection, we intentionally collected the hours in reverse (that's the .unshift in the original). This was because historical is done in a lower priority. So, it's possible data could come in more slowly and we wanted the charts to grow backwards instead of growing forward but from a time in the way past (if that makes sense). This code undoes that.

D'oh, I'll go over that one more time, then. I think I can just reverse what I've got now with the steps and each_cons bits.

Fryguy · 2017-03-14T20:34:58Z

~~If this is the same logic as gap collection, can we then use the same code from gap collection to keep it DRY?~~

Nevermind...looks like you did. I just misread the code.

blomquisg · 2017-03-14T20:35:44Z

app/models/metric/ci_mixin/capture.rb

+    intervals = []
+    start_hour = start_time
+    while start_hour != end_time
+      #TODO: replace 1.day with a configurable threshold defined by provider type


Meh ... follow-up PR if anything.

It's possible that some providers will gather so much data every hour that attempting to even get just one day would be too much data. Dunno.

Fryguy · 2017-03-14T20:35:59Z

app/models/metric/ci_mixin/capture.rb

+    intervals = []
+    start_hour = start_time
+    while start_hour != end_time
+      #TODO: replace 1.day with a configurable threshold defined by provider type


Maybe make it a parameter?

blomquisg · 2017-03-14T20:37:27Z

If this is the same logic as gap collection, can we then use the same code from gap collection to keep it DRY?

Yeah, it's the same logic ... and it's DRYish. The "historical" internal_name is equivalent to gap collection. And, this logic was ripped right from that and pulled into its own method.

Fryguy · 2017-03-14T20:54:16Z

app/models/metric/ci_mixin/capture.rb

+      end_hour = end_time if end_hour > end_time
+      intervals << [start_hour, end_hour]
+      start_hour = end_hour
+    end


This feels very manual. I think you can do this with

(start_time..end_time).step_value(1.day)

If you definitely need them in start/end pairs, just tack on a each_cons(2)

(start_time..end_time).step_value(1.day).each_cons(2)

This also feels very "copy'n'paste" 😄

I can update it to something prettier.

LOL I didn't see the original, which was why I thought you were writing something new. Heh, yeah, now that we have step_value method, it's a LOT nicer :)

Fryguy · 2017-03-14T20:56:00Z

spec/models/metric/ci_mixin/capture_spec.rb

@@ -36,6 +36,16 @@ def expected_stats_period_end
    parse_datetime('2013-08-28T12:41:40Z')
  end

+  context "#perf_capture_queue" do
+    it "splits up long perf_capture durations for old last_perf_capture_on" do
+      # there shoud be 10 days + a partial day, or 11 queue items


comment typo shoud -> should

agrare · 2017-03-14T21:03:49Z

app/models/metric/ci_mixin/capture.rb

+  def split_capture_intervals(start_time, end_time)
+    intervals = []
+    start_hour = start_time
+    while start_hour != end_time


Is there any validation on the end_time? Looks like this could loop infinitely if start_time and end_time aren't aligned

Yeah, this should definitely be a <. Or just use step_value as mentioned above.

step_value ftw

Fryguy · 2017-03-17T14:18:09Z

app/models/metric/ci_mixin/capture.rb

+    # [[2017-01-03 12:00:00 UTC, 2017-01-04 12:00:00 UTC],
+    #  [2017-01-02 12:00:00 UTC, 2017-01-03 12:00:00 UTC],
+    #  [2017-01-01 12:00:00 UTC, 2017-01-02 12:00:00 UTC]]
+    (start_time..end_time).step_value(threshold).each_cons(2).collect { |s_time, e_time| [s_time, e_time] }.reverse


Is the collect bit necessary? .each_cons already returns that format, so curious.

When I tried it, each_cons returned a single array:

> (1..20).step_value(2).each_cons(2) => #<Enumerator: [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 20]:each_cons(2)>

Whereas the collect will group them into the tuples:

> (1..20).step_value(2).each_cons(2).collect {|s, e| [s, e]} => [[1, 3], [3, 5], [5, 7], [7, 9], [9, 11], [11, 13], [13, 15], [15, 17], [17, 19], [19, 20]]

And, reverse gets the ordering correct:

> (1..20).step_value(2).each_cons(2).collect {|s, e| [s, e]}.reverse => [[19, 20], [17, 19], [15, 17], [13, 15], [11, 13], [9, 11], [7, 9], [5, 7], [3, 5], [1, 3]]

One thing I could do is pass in the interval_name as an argument and forego the extra call to map below.

Then, the collect call here really is manipulating the tuples.

I think you read it wrong...The first param is returning an Enumerator over the set, but it hadn't yet applied the each_cons. Calling to_a shows that each_cons by itself works.

> (1..20).step_value(2).each_cons(2).to_a => [[1, 3], [3, 5], [5, 7], [7, 9], [9, 11], [11, 13], [13, 15], [15, 17], [17, 19], [19, 20]]

However, TIL you can't call .reverse on a Enumerator directly. Even so, you can call it on .to_a

> (1..20).step_value(2).each_cons(2).to_a.reverse => [[19, 20], [17, 19], [15, 17], [13, 15], [11, 13], [9, 11], [7, 9], [5, 7], [3, 5], [1, 3]]

To your point though, if you pass in interval_name, then you just call map directly on each_cons.

> (1..20).step_value(2).each_cons(2).map { |t1, t2| [interval_name, t1, t2] }.reverse => [["realtime", 19, 20], ["realtime", 17, 19], ["realtime", 15, 17], ["realtime", 13, 15], ["realtime", 11, 13], ["realtime", 9, 11], ["realtime", 7, 9], ["realtime", 5, 7], ["realtime", 3, 5], ["realtime", 1, 3]]

Exactly this! 😄

Done.

Fryguy · 2017-03-17T14:27:37Z

app/models/metric/ci_mixin/capture.rb

      cb = {:class_name => self.class.name, :instance_id => id, :method_name => :perf_capture_callback, :args => [[task_id]]} if task_id
    end

+    # Determine what items we should be queuing up
+    items = [interval_name]
+    items = split_capture_intervals(start_time, end_time).map do |start_and_end_time|


Think this will be cleaner if instead of a single start_and_end_time, you have two separate, st and et variables, maybe?

Done by virtue of removing this line

Fryguy · 2017-03-17T14:28:27Z

spec/models/metric/ci_mixin/capture_spec.rb

+      original_queue_count = MiqQueue.count
+      @vm.last_perf_capture_on = 10.days.ago - 5.hours - 23.minutes
+      @vm.perf_capture_queue("realtime")
+      expect(MiqQueue.count - original_queue_count).to eq 11


It might be good to check some candidate time values and also to see if they are in reverse. You can use Timecop.freeze if you need to ensure specific time values.

When a target's `last_perf_capture_on` is far into the past, the next attempt to capture metrics for that target will result in an attempt to gather a potentially huge amount of metrics at one time. This change will break down the intervals between `last_perf_capture_on` and `today` into individual days. This is the same logic that's used when collecting gap metrics. https://bugzilla.redhat.com/show_bug.cgi?id=1429593

Remove a redundant call to `map`, and add a test to make sure that the dates in metrics collection are in a properly reversed order. https://bugzilla.redhat.com/show_bug.cgi?id=1429593

Fryguy · 2017-03-17T16:33:03Z

app/models/metric/ci_mixin/capture.rb

-      items[0] << start_time << end_time unless start_time.nil?
-
+      start_time = last_perf_capture_on unless start_time
+      end_time   = Time.now.utc unless end_time
      cb = {:class_name => self.class.name, :instance_id => id, :method_name => :perf_capture_callback, :args => [[task_id]]} if task_id


Could this cause a problem if we have multiple queue items calling the same callback now?

I can't really understand the code on line 79 below.

NM, I misread. Lemme look through that callback to see how that's being handled.

Fryguy · 2017-03-17T16:35:23Z

spec/models/metric/ci_mixin/capture_spec.rb

+  context "#perf_capture_queue" do
+    it "splits up long perf_capture durations for old last_perf_capture_on" do
+      # there should be 10 days + a partial day, or 11 queue items
+      Timecop.freeze


Prefer Timecop.freeze with a block. This will auto-return at the end even in the case of an exception.

Yeah, I saw that and changed it in another update ... testing that out locally and I can update in a few minutes

Fryguy · 2017-03-17T16:36:06Z

spec/models/metric/ci_mixin/capture_spec.rb

+    it "splits up long perf_capture durations for old last_perf_capture_on" do
+      # there should be 10 days + a partial day, or 11 queue items
+      Timecop.freeze
+      original_queue_count = MiqQueue.count


Alternately, just whack the queue before you start, though I'm very surprised to see queue items already in the queue.

So was I. It's because we queue up stop_event_monitor when we create OpenStack EMSes. And, it just so happens that all this testing is relying on OpenStack mocks.

I thought about whacking it, but didn't wanna impact anything else. I'm sure no one is relying on those existing queue messages, but ¯\_(ツ)_/¯.

Fryguy · 2017-03-17T16:37:45Z

spec/models/metric/ci_mixin/capture_spec.rb

+      queue_items = MiqQueue.peek(:conditions => {:zone       => @vm.my_zone,
+                                                  :queue_name => @vm.queue_name_for_metrics_collection,
+                                                  :role       => "ems_metrics_collector"},
+                                  :limit      => 3)


If you whack the queue items from the beginning, then you don't need to use peek or anything complicated here. Just MiqQueue.order(:id) should be sufficient to get you the items in order so that you can traverse over them.

Good point!

Fryguy · 2017-03-17T16:39:10Z

spec/models/metric/ci_mixin/capture_spec.rb

+      queue_items.each do |q_item|
+        q_start_time, q_end_time = q_item.args
+        expect(q_start_time).to eq start_time
+        expect(q_end_time).to eq end_time


You might want to use be_same_time_as for the times here for start and end time. It's possible that since we are serializing that there could be a loss of precision which would given sporadic test failures.

Fryguy · 2017-03-17T16:41:13Z

app/models/metric/ci_mixin/capture.rb

+    #  [interval_name, 2017-01-01 12:00:00 UTC, 2017-01-02 12:00:00 UTC]]
+    start_time = start_time.utc
+    end_time = end_time.utc
+    (start_time..end_time).step_value(threshold).each_cons(2).collect do |s_time, e_time|


Minor, but I think you can collapse out the .utc calls into the range...

(start_time.utc..end_time.utc).step_value(

Yeah, wilco

Ensure that perf_capture_queue works on long intervals and single days.

miq-bot · 2017-03-20T18:32:54Z

Checked commits blomquisg/manageiq@5571320~...7bd3090 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks good. 🍰

blomquisg · 2017-03-20T19:30:50Z

I can't reproduce the same spec failure locally with the same seed. I'll close/reopen and see if that helps. If not, I'll rebase on master and see what I'm up against there.

blomquisg · 2017-03-20T19:36:02Z

FWIW, I rebased locally and tried the same seed and couldn't get the same error on spec/models/metric_spec.rb:333.

Split metric collections into smaller intervals (cherry picked from commit 660e3a3) https://bugzilla.redhat.com/show_bug.cgi?id=1435278

simaishi · 2017-03-23T13:57:51Z

Euwe backport details:

$ git log -1
commit 41721b42d58bb5a50aee449bbda8d57e71cfe4eb
Author: Jason Frey <fryguy9@gmail.com>
Date:   Tue Mar 21 16:11:54 2017 -0400

    Merge pull request #14332 from blomquisg/splitting_gap_collection
    
    Split metric collections into smaller intervals
    (cherry picked from commit 660e3a36f6d5655e8173065ad14a34ab9ef31207)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1435278

Previously, if the `last_perf_capture_on` value was very very old, MIQ would attempt to gather all of the metrics from `last_perf_capture_on` until now. ManageIQ#14332 introduced the concept of grouping the historical metrics by days, so the collector worker would not try to gather *ALL* of the metrics at once. This change further restricts the metrics by favoring the more recent date between: * `last_perf_capture_on` * Metric::Capture.historical_start_time If `last_perf_capture_on` is too far in the past, `Metric::Capture.historical_start_time` will be a forceful override.

…tion Split metric collections into smaller intervals (cherry picked from commit 660e3a3) https://bugzilla.redhat.com/show_bug.cgi?id=1435278

blomquisg added bug providers/metrics labels Mar 14, 2017

blomquisg assigned Fryguy Mar 14, 2017

blomquisg requested a review from agrare March 14, 2017 20:27

blomquisg commented Mar 14, 2017

View reviewed changes

agrare requested a review from jameswnl March 14, 2017 20:33

blomquisg commented Mar 14, 2017

View reviewed changes

Fryguy reviewed Mar 14, 2017

View reviewed changes

agrare reviewed Mar 14, 2017

View reviewed changes

blomquisg force-pushed the splitting_gap_collection branch 2 times, most recently from 406d016 to 9e06dbe Compare March 14, 2017 21:46

Fryguy reviewed Mar 17, 2017

View reviewed changes

blomquisg added 3 commits March 17, 2017 10:53

Cleanup split_capture_intervals method

82b83dd

Test for reversed metric dates

f874607

Remove a redundant call to `map`, and add a test to make sure that the dates in metrics collection are in a properly reversed order. https://bugzilla.redhat.com/show_bug.cgi?id=1429593

blomquisg force-pushed the splitting_gap_collection branch from 9e06dbe to f874607 Compare March 17, 2017 15:26

Fryguy reviewed Mar 17, 2017

View reviewed changes

blomquisg added 2 commits March 20, 2017 14:10

Simplify split_capture_intervals

1b539ce

More testing for perf_capture_queue

8b24751

Ensure that perf_capture_queue works on long intervals and single days.

Fryguy approved these changes Mar 20, 2017

View reviewed changes

Fixes for rubocops in metric capture

7bd3090

blomquisg closed this Mar 20, 2017

blomquisg reopened this Mar 20, 2017

Fryguy merged commit 660e3a3 into ManageIQ:master Mar 21, 2017

Fryguy added this to the Sprint 57 Ending Mar 27, 2017 milestone Mar 21, 2017

blomquisg added the euwe/yes label Mar 23, 2017

simaishi added the blocker label Mar 23, 2017

simaishi pushed a commit that referenced this pull request Mar 23, 2017

Merge pull request #14332 from blomquisg/splitting_gap_collection

41721b4

Split metric collections into smaller intervals (cherry picked from commit 660e3a3) https://bugzilla.redhat.com/show_bug.cgi?id=1435278

simaishi added euwe/backported and removed euwe/yes labels Mar 23, 2017

blomquisg mentioned this pull request Apr 4, 2017

Restrict historical metrics collection #14640

Closed

jameswnl mentioned this pull request Apr 18, 2017

Add a dummy args to storage.perf_capture #14790

Merged

agrare mentioned this pull request Jul 17, 2017

[WIP] Drop collection of old realtime capture requests #15422

Closed

blomquisg deleted the splitting_gap_collection branch November 28, 2017 18:32

Split metric collections into smaller intervals #14332

Split metric collections into smaller intervals #14332

Conversation

blomquisg commented Mar 14, 2017 • edited Loading

Links

Steps for Testing/QA

blomquisg Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fryguy commented Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

Fryguy Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

blomquisg commented Mar 14, 2017

Fryguy Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fryguy Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

Fryguy Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fryguy Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blomquisg Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miq-bot commented Mar 20, 2017

blomquisg commented Mar 20, 2017

blomquisg commented Mar 20, 2017

simaishi commented Mar 23, 2017

blomquisg commented Mar 14, 2017 •

edited

Loading

blomquisg Mar 14, 2017 •

edited

Loading

Fryguy commented Mar 14, 2017 •

edited

Loading

Fryguy Mar 14, 2017 •

edited

Loading

Fryguy Mar 14, 2017 •

edited

Loading

Fryguy Mar 17, 2017 •

edited

Loading

Fryguy Mar 17, 2017 •

edited

Loading

Fryguy Mar 17, 2017 •

edited

Loading

blomquisg Mar 17, 2017 •

edited

Loading