POC: add aggregate funciton top and bottom #409 #551

chobie · 2014-05-17T18:03:58Z

Updated: 20 May 2014

This patch provides aggregate function top and bottom. these function are useful when executing analytics or summarize queries like:

select top(elapsed_time, 10), url from access group by url;

Added functions:

  top(column, limit)
  bottom(column, limit)

select top(value, 10) from series will output 10 value values with descending order.
select bottom(value, 10) from series will output 10 value values with ascending order.
select top(value, 10), name from series group by name will output 10 value and name values with descending order (order key is 1st column. this doesn't support multiple sort).

supported column types are int64, float64 and string.

And, I'm not good at English and Golang. so if you find some wrong points, please feel free to comment it.

this patch provides aggregate function top and low. these function are useful when executing analytics or summarize queries. Added functions: low(columns..., limit) top(columns..., limit) For instance: `select low(value, 10) from series` will output 10 `value` values with ascending order. `select top(value, 10) from series` will output 10 `value` values with descending order. How to build: go get github.com/ryszard/goskiplist/skiplist go build src/daemon.go Currently, these function support float64 and int64 types. Limitations: * can't handle same value correctly (goskiplist limitation) Misc: * each column should support alias. e.g) low(score as low_score, 10) And, I'm not good at English and Golang. so if you find some wrong points, plz feel free to fix it.

pauldix · 2014-05-17T18:21:57Z

Thanks for the PR @chobie! Here are a few thoughts. I just added comments to #409 about a new way to structure the query. I think using having is cleaner. So there are a few things we'd need here. First would be some tests around it. I'd like to see the three cases I bulleted in #409 to have tests.

The nice thing about the having way of doing things is that it works cleanly for when you're doing counts on grouped by columns.

Then also have tests for what happens if a top query is run against a non-numerical column.

A comment on the implementation. I think it's overkill to use a skiplist. Most people are going to be asking for the top 10 or to 50 at most. The simplest method is to have an array of points, implement sorting methods to sort by the field. Should be easy by following this: https://github.com/influxdb/influxdb/blob/master/src/protocol/protocol_extensions.go#L125-L164

Pick the sort based on if you're doing top or bottom. Have the slice sized the same as the N in top. Then as each new point comes in, check vs the last element in the slice. If it's greater than, replace, resort.

Sorts on small slices are probably very efficient.

If it can be updated to work like those last two comments I left in #409 with tests added, we'd jump at taking this PR. Thanks again, hope that's not too much of a pain. Or let me know if there's something that you think would be cleaner or better.

chobie · 2014-05-18T16:49:58Z

Added test cases and choose sort based implementation. Let me know If I missed something.

select mean(value), host from cpu_idle group by host having top(10, mean)  where time > now() - 10m

That's great. I prefer adding having clause (#183) to this patch.

The other potential problem is that when you're doing a top query over the series, you'd probably want the timestamps of the points that are actually at the top

This would be helpful for finding problems with top. For example: to show slow url with select top(100, elapsed, url, user_id) from access query. we could find problem easily if we have timestamp. but we can find the cause with querying multiple times. so it's trivial thing.

For now, my blocker was solved by this. I'm not in any rush though.

chobie · 2014-05-18T16:53:21Z

btw, What is the better way to do data_test test cases? go test integration takes several minutes.
I've commented out unwanted test cases.

pauldix · 2014-05-18T17:18:52Z

reviewing now. On the data_test thing, the best place is in integration/data_test.go. Then you can run by doing make integration_test only=DataTest which should run quickly.

pauldix · 2014-05-18T18:36:14Z

So this doesn't implement the having portion of the query I take it. What happens when you do these queries?

-- say you only have 3 data points total and you run this
select top(10, value) from some_series

--- what happens here?
select top(10, value), host from cpu_idle group by time(1h), host

It's not clear to me what the additional column arguments in the top function do. Do those just select columns to return? If so then it should be more like this:

select top(10, value), host from cpu_idle

That means you'll filter by the top value and get the host.

I also just realized that the percentile function has the number and column parameters reversed, so top should work like that too to keep things consistent. So:

select top(value, 10), host from cpu_idle

…or it. top(column, limit) bottom(column, limit)

chobie · 2014-05-19T02:13:49Z

I've changed top and bottom function arguments. Is my understanding correct?

I'll improve this when I back home.

We need to detect correct column offset when group by clause contains other fields. otherwise we correct wrong field.

previous commit was just my wrong approach. basically don't need to keep whole columns.

chobie · 2014-05-19T17:41:27Z

updated. probably this will work fine. could you review again when you have a chance?

tisba · 2014-05-20T14:15:16Z

That is an awesome feature! Thanks!

tisba · 2014-05-20T14:16:14Z

Will the documentation be updated to reflect those new functions?

pauldix · 2014-05-20T16:41:17Z

It will, I'm on it.

On Tue, May 20, 2014 at 10:16 AM, Sebastian Cohnen <notifications@github.com

wrote:

Will the documentation be updated to reflect those new functions?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/551#issuecomment-43631075
.

this patch provides aggregate function top and low. these function are useful when executing analytics or summarize queries. Added functions: low(columns..., limit) top(columns..., limit) For instance: `select low(value, 10) from series` will output 10 `value` values with ascending order. `select top(value, 10) from series` will output 10 `value` values with descending order. How to build: go get github.com/ryszard/goskiplist/skiplist go build src/daemon.go Currently, these function support float64 and int64 types. Limitations: * can't handle same value correctly (goskiplist limitation) Misc: * each column should support alias. e.g) low(score as low_score, 10) And, I'm not good at English and Golang. so if you find some wrong points, plz feel free to fix it.

samuraraujo · 2014-05-28T14:05:31Z

Dear Paul, is there any plan to support my first suggestion?

" select top(10, value) from cpu.* where time > now() - 10m"
"I would expect It to retrieve the top 10 series."

I have thousands of series and I would like to obtain the top 10 on a specific interval. The mean or sum can be used to rank the series. e.g. select top(10, mean(value)) from cpu.* where time > now() - 10m

Currently, I have to run: select mean(value) from cpu.* where time > now() - 10m
Then, it retrieves hundred thousands of series that are sorted inside my application. This is not efficient.

Please let me know if you have this feature on your roadmap.

Best

pauldix · 2014-05-28T14:21:06Z

It's definitely something we want to get to. However, we'll probably be doing column indexes first: #582. And when we do those the better way to structure things would be to have a series called cpu and have the host as an indexed column. Then the query using #183 and top would work.

It'll probably be more than a month until we get to this since there are some other features that are much higher priority at the moment.

chobie added 5 commits May 18, 2014 11:32

change top and low arguments order

80b8e98

change low aggregate function name to bottom

b4264d7

use protocol.PointsCollection instead goskiplist

4987717

prevent panic and fix too many columnorders

a3cb0cb

add top and bottom test cases

f242e0c

chobie changed the title ~~POC: add aggregate funciton top and low #409~~ POC: add aggregate funciton top and bottom #409 May 18, 2014

change top and bottom function arguments and add group by test case f…

796630d

…or it. top(column, limit) bottom(column, limit)

chobie added 3 commits May 19, 2014 22:36

remove needless statements

c4acbd2

Fix column index

dcd34cb

We need to detect correct column offset when group by clause contains other fields. otherwise we correct wrong field.

Should use GetValue function instead of my implementation.

f198cdd

previous commit was just my wrong approach. basically don't need to keep whole columns.

jvshahid closed this in d382ea0 May 19, 2014

jvshahid added this to the 0.6.5 milestone May 19, 2014

pauldix mentioned this pull request May 28, 2014

Add support to select the top value from a collection of series #586

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: add aggregate funciton top and bottom #409 #551

POC: add aggregate funciton top and bottom #409 #551

chobie commented May 17, 2014

pauldix commented May 17, 2014

chobie commented May 18, 2014

chobie commented May 18, 2014

pauldix commented May 18, 2014

pauldix commented May 18, 2014

chobie commented May 19, 2014

chobie commented May 19, 2014

tisba commented May 20, 2014

tisba commented May 20, 2014

pauldix commented May 20, 2014

samuraraujo commented May 28, 2014

pauldix commented May 28, 2014

POC: add aggregate funciton top and bottom #409 #551

POC: add aggregate funciton top and bottom #409 #551

Conversation

chobie commented May 17, 2014

Added functions:

pauldix commented May 17, 2014

chobie commented May 18, 2014

chobie commented May 18, 2014

pauldix commented May 18, 2014

pauldix commented May 18, 2014

chobie commented May 19, 2014

chobie commented May 19, 2014

tisba commented May 20, 2014

tisba commented May 20, 2014

pauldix commented May 20, 2014

samuraraujo commented May 28, 2014

pauldix commented May 28, 2014