Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add high cardinality requirements doc #7175

Closed
wants to merge 2 commits into from
Closed

Add high cardinality requirements doc #7175

wants to merge 2 commits into from

Conversation

jwilder
Copy link
Contributor

@jwilder jwilder commented Aug 17, 2016

This is start at a requirements doc for #7151.

cc @benbjohnson @e-dard @pauldix

@jwilder jwilder added this to the 1.1.0 milestone Aug 17, 2016
@benbjohnson
Copy link
Contributor

👍

### Performance

1. The index must be able to support 1B+ series without exhausting RAM
2. Startup times should must not exceed 5 mins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this to be even lower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and I think we should aim to have it lower. I used 5 mins because with one restart in a year, that would be the amount of downtime allowed for five 9s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to 1min

@pauldix
Copy link
Member

pauldix commented Aug 18, 2016

Should also be able to show tag keys by measurement and be able to show tag values by key and measurement.

@e-dard
Copy link
Contributor

e-dard commented Aug 18, 2016

In terms of the performance we should probably have a reference machine/architecture that we expect to meet the targets on.

@pauldix
Copy link
Member

pauldix commented Aug 18, 2016

The perf targets are tricky. Might be easier to quantify other things than length of time for startup or query planning.

Like, query planning shouldn't require sorting anything (i.e. things should already be in sorted order). Or, if planning a query against a section of the index that is cold, it shouldn't require more than N disk seeks (for some value of N that makes sense).

Likewise, startup shouldn't require more than opening the file handles for TSM and index files and reading the WAL to load up in memory structures. Then also say that the WAL should be no larger than M bytes, which would be the thing that impacts startup time.

@jwilder
Copy link
Contributor Author

jwilder commented Aug 18, 2016

The performance targets are high level requirements that would meet a users needs. They should be verifiable through testing to determine if we are meeting them or not. I think latency targets for startup make sense because that directly affects the end user (See #6250). I'd rather not have them too specific to how the index is implemented as this document will drive design ideas.

@e-dard
Copy link
Contributor

e-dard commented Aug 18, 2016

There isn't anything in this document around renaming measurements, tags or values. I know it's not relevant to the problem we're trying to solve, but it might be worth being sympathetic to making this possible in the future when thinking about implementation. Just a thought.

@jwilder
Copy link
Contributor Author

jwilder commented Aug 18, 2016

@e-dard That's a good point, but I think that could broaden the scope too much. Renaming also affects the TSM file indexes.

6. `SELECT count(value) FROM cpu where host ='server-01' AND location = 'us-east1' GROUP BY host`
7. `DROP MEASUREMENT cpu`
8. `DROP SERIES cpu WHERE time > now() - 1h`
9. `DROP SEREIES cpu WHERE host = 'server-01'`
Copy link
Contributor

@e-dard e-dard Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some other queries here, which have suffered from performance problems in the past?

The execution time for the two queries below for example should be fast, and get slower gracefully with the amount data we store. Whether we can achieve O(1) or something worse like ~O(n log n) will probably depend on the TSI implementation.

SELECT first(value) FROM cpu
SELECT value FROM cpu ORDER BY ASC LIMIT 1
SELECT last(value) FROM cpu
SELECT value FROM cpu ORDER BY DESC LIMIT 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These queries are similar to the ones already listed from how they would interact with the index. I was trying to come up with some scenarios where the index would be used and stressed in different ways.

For example, SELECT first(value) FROM cpu needs to access the index essentially the same as DROP MEASUREMENT cpu in that the index would need to be queried to determine all series keys for cpu and then process those series.

The first, last, vs order by desc, is more of a query engine thing than an index issue because all four would hit the index to return all series for cpu and then let the query engine figure out the first, last, etc..

Looking at current ones, I think adding a regex scenario is missing and should be added. Also different boolean logic for tags as opposed to just AND which would stress how we merged series sets in the index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, SELECT first(value) FROM cpu needs to access the index essentially the same as DROP MEASUREMENT cpu in that the index would need to be queried to determine all series keys for cpu and then process those series.

Without meaning to jump too far into implementation details in the requirements doc, would it not be possible to maintain the first/last value for value within the index, so we don't need to scan any series keys at all? I guess it's hard to go down that path and still provide a drop-in replacement for the current index.

@pauldix
Copy link
Member

pauldix commented Aug 19, 2016

One thing we may want to add later to the query language is a starts with type query for measurement names and tag values. The only matching we have now is against regex, which is very inefficient since it has to scan all possible values.

Starts with is quite useful for doing auto-completion inside of a UI and those can be optimized much better than regex matches.

@benbjohnson
Copy link
Contributor

We could also rewrite simple regular expressions with a trailing .* to STARTS WITH queries automatically.

@e-dard
Copy link
Contributor

e-dard commented Aug 23, 2016

LGTM 👍

@pauldix
Copy link
Member

pauldix commented Aug 24, 2016

One other thing I think we might want to add to this. For finding series, measurements, tag keys and tag values, we may want to have some method for returning a list on some rough period of time.

For example, if a user keeps their data around for a long time and they have a bunch of hosts or docker container IDs that are old, often they'll probably only want to return the set of items that have been written to in the last 24 hours.

Or if they're going back in time they wouldn't want to see everything for all time, just the relevant entries for that time range.

I don't think it needs to be down to the second accurate. Even 24 hours would probably be good enough. Just to narrow the space of items that show up. If accuracy is needed then the underlying series could be queried to see if they have data in that range.

What you guys think?

@jwilder jwilder modified the milestones: 1.2.0, 1.1.0 Oct 6, 2016
@timhallinflux timhallinflux modified the milestones: 1.3.0, 1.2.0 Dec 19, 2016
@rbetts
Copy link
Contributor

rbetts commented May 30, 2017

No further action specific to the 1.3 milestone. Closing this issue. We will track ongoing TSI work separately.

@rbetts rbetts closed this May 30, 2017
@jwilder jwilder deleted the jw-cardinality branch April 20, 2018 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants