Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: GROUP BY <field> #7200

Closed
ProTip opened this issue Aug 24, 2016 · 8 comments
Closed

Feature: GROUP BY <field> #7200

ProTip opened this issue Aug 24, 2016 · 8 comments
Labels
1.x area/influxql Issues related to InfluxQL query language area/queries difficulty/high This issue needs to be broken down into smaller units of work. flux/triaged kind/feature-request wontfix

Comments

@ProTip
Copy link

ProTip commented Aug 24, 2016

I would like to propose the ability to GROUP BY fields.

Storing high cardinality data in tags currently blows up the database for a lot of people. Even with the new index proposal it could cause significant paging and performance degradation. Further, the domain may be unbounded and more efficiently stored and processed as a field data.

I would propose that influxDB support grouping data on field values during processing. Even with very high cardinality unbounded domains would be bounded by time window or time buckets; stream processing techniques should handle this nicely.

Use Case

Analyzing access logs including request path. The request path may include ID's and be technically unbounded. A person may wish to look at an hours worth of requests grouped by time(1m), request_path to see the top requested paths by minutes. Total requests path cardinality would not exceed one hours worth of point's for SELECT, or one minute's worth of points for a SELECT INTO. Backfilling this data with the path as a tag currently crashes influxDB.

@jsternberg
Copy link
Contributor

I don't think this is possible just by the nature of how grouping works. There isn't a way to efficiently group by fields within the query engine without storing all of the returned points in memory and performing a massive amount of sorting. I think the underlying problem is the database index for tags and we are currently working on that in #7151.

@jwilder do you think this should be closed in favor of #7151?

@jwilder
Copy link
Contributor

jwilder commented Aug 25, 2016

@jsternberg I don't see how #7151 is related to this. #7151 is about removing the in-memory index off of the heap.

@jsternberg
Copy link
Contributor

The feature request mentioned how the current in-memory index causes problems and it makes high cardinality data impossible to use. #7151 is for making high cardinality data perform better so I figured that it would invalidate the need for this. I didn't notice that it also mentioned the new proposal, but I'm not sure it really matters since I don't know if we would be able to efficiently group by field data.

@jwilder jwilder added area/queries kind/feature-request area/influxql Issues related to InfluxQL query language labels Aug 25, 2016
@wladekb
Copy link

wladekb commented Sep 16, 2016

This would also help us in a slightly different use case. We load raw pageload performance data along with metadata into a short-living measurement. We then have a set of continuous queries that aggregate the data by different axis eg. one that generates a single aggregate, and some other one that groups by country and so on.

In the current influxdb version we need to know which columns may be used for grouping so that they are loaded as tags. Moreover starting to use a new column as a dimension requires us to modify the load process so that it is put as tag. At the same time we want to keep the number of tags low to prevent generating an enormous number of series.

I could somehow summarize this use case as mini-hadoop but with better response time and flexibility.

@ryanmills
Copy link

ryanmills commented Dec 7, 2016

Given that we can't update tags, +1 on this so that we can GROUP BY fields at a later stage as the schema changes

@nathanielc nathanielc added the difficulty/high This issue needs to be broken down into smaller units of work. label Oct 1, 2018
@nathanielc
Copy link
Contributor

Added difficulty/high because the current mechanics around grouping leverage the index/cursors and there are no mechanisms to group by anything else. Flux is already capable of grouping by fields and time.

@dgnorton dgnorton added the 1.x label Jan 7, 2019
@stale
Copy link

stale bot commented Jul 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 23, 2019
@stale
Copy link

stale bot commented Jul 31, 2019

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

@stale stale bot closed this as completed Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x area/influxql Issues related to InfluxQL query language area/queries difficulty/high This issue needs to be broken down into smaller units of work. flux/triaged kind/feature-request wontfix
Projects
None yet
Development

No branches or pull requests

7 participants