Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for derivative() CQ's with empty buckets #3383

Closed
wants to merge 5 commits into from

Conversation

kruckenb
Copy link

As mentioned beginning at #3247 (comment)
Setting recompute intervals using recompute-previous-n results in empty intervals when the interval doesn't overlap every sampling rate window.
The GROUP BY time() in the CQ already buckets on the correct intervals, so recompute-previous-n is unnecessary.
This is a fix against 0.9.2 branch, would be awesome if this could ship as part of 0.9.2

@kruckenb kruckenb changed the title compute continuous queries back to 'recompute-no-older-than' compute continuous queries back to 'recompute-no-older-than', let 'GROUP BY' handle bucketing Jul 18, 2015
@kruckenb kruckenb changed the title compute continuous queries back to 'recompute-no-older-than', let 'GROUP BY' handle bucketing Fix for derivative() CQ's with empty buckets Jul 18, 2015
@kruckenb
Copy link
Author

It looks like the tests are failing due to networking issues?

@otoolep
Copy link
Contributor

otoolep commented Jul 18, 2015

@kruckenb - the builds are failing due to compilation errors. Please run go test ./... on your tree before pushing up code, that way you can check yourself if there are build breaks. Check CONTRIBUTING.md for more details

[collectd] 2015/07/18 14:36:08 collectd UDP closed
--- PASS: TestService_BatchSize (0.01s)
    service_test.go:73: closing service
    service_test.go:73: closing service
    service_test.go:73: closing service
[collectd] 2015/07/18 14:36:08 collectd ReadFromUDP error: read udp 127.0.0.1:56594: use of closed network connection
[collectd] 2015/07/18 14:36:08 collectd UDP closed
--- PASS: TestService_BatchDuration (0.25s)
    service_test.go:138: closing service
PASS
ok      github.com/influxdb/influxdb/services/collectd  0.264s
?       github.com/influxdb/influxdb/services/collectd/test_client  [no test files]
# github.com/influxdb/influxdb/services/continuous_querier
services/continuous_querier/service_test.go:97: s.Config.RecomputePreviousN undefined (type *Config has no field or method RecomputePreviousN)
FAIL    github.com/influxdb/influxdb/services/continuous_querier [build failed]

@kruckenb
Copy link
Author

@otoolep passes tests now.

@pauldix
Copy link
Member

pauldix commented Jul 23, 2015

Not sure if this is right. The point of recompute-previous-n was to have the CQ service recompute intervals that have already passed. This is to deal with the problem of delayed data coming in.

@kruckenb
Copy link
Author

@pauldix This patch doesn't remove computing previous intervals. Previous intervals are still computed back to recompute-no-older-than, and the CQ itself handles the bucketing with it's GROUP BY time() clause.

The problem with recompute-previous-n is that it effectively adds another layer of windowing, and results in some intervals having no results (and being filled in with 0's). For example, if two samples happen to fall into the same recompute-previous-n window, the CQ will only generate a result for the second sample, and the result for the first sample timestamp will be 0. This happens periodically, depending on how the windows for samples and running CQ's line up.

This fixes another issue that is also at the root of #3247: if you run a CQ with a GROUP BY time(30s) against a series with 30-second samples, the result will mostly/always be zero, because every interval (limited by the recompute-previous-n window) will have a single sample and derivative() requires at least 2 samples. If you change the CQ GROUP BY to a longer time-span then you'll start to get some 0 results because of the windowing problem mentioned above. Eliminating the recompute-previous-n with this patch fixes this problem as long as recompute-no-older-than is longer than 2x the sampling period.

I'm not sure I'm explaining it clearly (there's a lot to this), if it would help I can post some before-and-after details.

@pauldix
Copy link
Member

pauldix commented Jul 23, 2015

The original goal with the combination of recompute-no-older-than and recompute-previous-n was to give the user controls over how much gets computed when they have CQs with different group by times.

For instance, if you have a CQ with a group by time of 1h, you'd only want to recompute the interval before it. However, if you have a CQ with a group by time of 30s, you may want to compute the last 10 intervals just to make sure that you pull back lagged data.

Having only a single time interval means that you could end up not recomputing those 1h buckets. Unless every CQ query always forces at least the current period and the previous period to be computed, which would solve that problem.

They could set the recompute-no-older-than to something like 10m which would be their window of lagged data, but CQs with longer group by times would still be ok because we always recompute the previous interval.

That make sense?

@kruckenb
Copy link
Author

@pauldix I think I know how to resolve this so it uses recompute-no-older-than and recompute-previous-n to calculate the full recompute window, but eliminates the intermediate time windows (the key to the fix).

  1. Use pre-patch computation to calculate full recompute window based on recompute-no-older-than and recompute-previous-n values
  2. Run the CQ query once using full time window from Step 1 and rely on the CQ GROUP BY time() clause to compute values for intermediate intervals. This approach will provide enough samples to derivative() to avoid 0-value intervals.

If that approach sounds right I'll code it up.

@pauldix
Copy link
Member

pauldix commented Jul 23, 2015

@kruckenb yeah, I think that sounds right. Would be nice to get some unit tests around #1 to ensure that the number of intervals/length of CQ time is sane given different settings and group by intervals.

@kruckenb
Copy link
Author

Refactored to support both recompute-previous-n and recompute-no-older-than

@pauldix
Copy link
Member

pauldix commented Jul 27, 2015

I think there might be one small issue on the setting of the time if recompute-no-older-than is hit. For instance, say they have recompute-previous-n of 3 and recompute-no-older-than set to 20m. Then say they have a CQ where the group by time is 1h.

In that case the recompute-previous-n would go back 3 hours. Your logic then sets the start time to be 20m previous to now. Which would end up writing bad results for that previous window in some cases.

What we actually want to happen is that in all cases, the start time is rounded down to the previous interval. So in the case of my example, we'd actually always compute the last hour and the current hour.

In the case of a 1m group by time, we'd always compute the previous 3 minutes and the current 3 minutes.

@sebito91
Copy link
Contributor

+1 on this commit!

We're looking to graph rate metrics like packets/sec for network traffic, and using the deriv is one key way to implement this. Having gone back and forth trying to get consistent data in our graphs (using grafana), it's somewhat confusing as to the best approach.

Overall question, does it make more sense to apply a query as below in a CQ or just have it run on its own? Seems like a CQ may lead to more consistent data with the ability to tweak the re-calc knobs, but is that best practice for such a (seemingly common) metric?

SELECT derivative(last(value)) FROM "net_packets_recv" WHERE "host" = 'hostname' AND "interface" = 'bond0' and time > now() - 1h group by time(10s)

I'm curious to try out the patch from @kruckenb for a CQ with the above query to ensure we don't receive 0 values. At this point, whether from GUI or CLI we get periodic 0-values returned if we happen to query between intervals. If I'm in a tight loop accessing the same set of data (say a read query from 100 hosts at the same time), some will get all 0's but some will get full data...

> select derivative(last(value)) from net_packets_recv where host='hostname' and interface='enp129s0f0' and time > now() - 1h group by time(10s)
name: net_packets_recv
----------------------
time            derivative
2015-09-23T15:31:10Z    0

<wait about one/two seconds>

> select derivative(last(value)) from net_packets_recv where host='hostname' and interface='enp129s0f0' and time > now() - 1h group by time(10s)
name: net_packets_recv
----------------------
time            derivative
2015-09-23T15:31:30Z    216.8
2015-09-23T15:31:40Z    216.8
2015-09-23T15:31:50Z    217.9
2015-09-23T15:32:00Z    230.5
...
... 

@sebito91
Copy link
Contributor

Following up yet again, found a post with a tip buried deep within...though it does present its own problems. For derivative functions (likely any other rate calcs), you should append fill(0) to your query:

https://groups.google.com/forum/#!topic/influxdb/YbbWL2KCyVk

This does result in some wonky first-bucket calculation periodically but overall does what you need.

@jsternberg
Copy link
Contributor

Any fix for derivative should also fix derivatives when querying normally. We may revisit this kind of solution later, but I think we're going to try implementing the suggestion in #5943.

Thanks for your contribution. I'm going to close this PR.

@jsternberg jsternberg closed this Mar 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants