-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: add aggregate funciton top and bottom #409 #551
POC: add aggregate funciton top and bottom #409 #551
Conversation
this patch provides aggregate function top and low. these function are useful when executing analytics or summarize queries. Added functions: low(columns..., limit) top(columns..., limit) For instance: `select low(value, 10) from series` will output 10 `value` values with ascending order. `select top(value, 10) from series` will output 10 `value` values with descending order. How to build: go get github.com/ryszard/goskiplist/skiplist go build src/daemon.go Currently, these function support float64 and int64 types. Limitations: * can't handle same value correctly (goskiplist limitation) Misc: * each column should support alias. e.g) low(score as low_score, 10) And, I'm not good at English and Golang. so if you find some wrong points, plz feel free to fix it.
Thanks for the PR @chobie! Here are a few thoughts. I just added comments to #409 about a new way to structure the query. I think using having is cleaner. So there are a few things we'd need here. First would be some tests around it. I'd like to see the three cases I bulleted in #409 to have tests. The nice thing about the having way of doing things is that it works cleanly for when you're doing counts on grouped by columns. Then also have tests for what happens if a top query is run against a non-numerical column. A comment on the implementation. I think it's overkill to use a skiplist. Most people are going to be asking for the top 10 or to 50 at most. The simplest method is to have an array of points, implement sorting methods to sort by the field. Should be easy by following this: https://github.com/influxdb/influxdb/blob/master/src/protocol/protocol_extensions.go#L125-L164 Pick the sort based on if you're doing top or bottom. Have the slice sized the same as the N in top. Then as each new point comes in, check vs the last element in the slice. If it's greater than, replace, resort. Sorts on small slices are probably very efficient. If it can be updated to work like those last two comments I left in #409 with tests added, we'd jump at taking this PR. Thanks again, hope that's not too much of a pain. Or let me know if there's something that you think would be cleaner or better. |
Added test cases and choose sort based implementation. Let me know If I missed something.
That's great. I prefer adding having clause (#183) to this patch.
This would be helpful for finding problems with For now, my blocker was solved by this. I'm not in any rush though. |
btw, What is the better way to do |
reviewing now. On the |
So this doesn't implement the -- say you only have 3 data points total and you run this
select top(10, value) from some_series
--- what happens here?
select top(10, value), host from cpu_idle group by time(1h), host It's not clear to me what the additional column arguments in the top function do. Do those just select columns to return? If so then it should be more like this: select top(10, value), host from cpu_idle That means you'll filter by the top value and get the host. I also just realized that the select top(value, 10), host from cpu_idle |
…or it. top(column, limit) bottom(column, limit)
I've changed top and bottom function arguments. Is my understanding correct? I'll improve this when I back home. |
We need to detect correct column offset when group by clause contains other fields. otherwise we correct wrong field.
previous commit was just my wrong approach. basically don't need to keep whole columns.
updated. probably this will work fine. could you review again when you have a chance? |
That is an awesome feature! Thanks! |
Will the documentation be updated to reflect those new functions? |
It will, I'm on it. On Tue, May 20, 2014 at 10:16 AM, Sebastian Cohnen <notifications@github.com
|
this patch provides aggregate function top and low. these function are useful when executing analytics or summarize queries. Added functions: low(columns..., limit) top(columns..., limit) For instance: `select low(value, 10) from series` will output 10 `value` values with ascending order. `select top(value, 10) from series` will output 10 `value` values with descending order. How to build: go get github.com/ryszard/goskiplist/skiplist go build src/daemon.go Currently, these function support float64 and int64 types. Limitations: * can't handle same value correctly (goskiplist limitation) Misc: * each column should support alias. e.g) low(score as low_score, 10) And, I'm not good at English and Golang. so if you find some wrong points, plz feel free to fix it.
Dear Paul, is there any plan to support my first suggestion?
I have thousands of series and I would like to obtain the top 10 on a specific interval. The mean or sum can be used to rank the series. e.g. select top(10, mean(value)) from cpu.* where time > now() - 10m Currently, I have to run: select mean(value) from cpu.* where time > now() - 10m Please let me know if you have this feature on your roadmap. Best |
It's definitely something we want to get to. However, we'll probably be doing column indexes first: #582. And when we do those the better way to structure things would be to have a series called cpu and have the host as an indexed column. Then the query using #183 and top would work. It'll probably be more than a month until we get to this since there are some other features that are much higher priority at the moment. |
Updated: 20 May 2014
This patch provides aggregate function top and bottom. these function are useful when executing analytics or summarize queries like:
Added functions:
select top(value, 10) from series
will output 10value
values with descending order.select bottom(value, 10) from series
will output 10value
values with ascending order.select top(value, 10), name from series group by name
will output 10value
andname
values with descending order (order key is 1st column. this doesn't support multiple sort).supported column types are int64, float64 and string.
And, I'm not good at English and Golang. so if you find some wrong points, please feel free to comment it.