Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should not allow field and tag with same key #2615

Closed
beckettsean opened this issue May 19, 2015 · 18 comments
Closed

should not allow field and tag with same key #2615

beckettsean opened this issue May 19, 2015 · 18 comments

Comments

@beckettsean
Copy link
Contributor

We should raise an error if a field key has the same name as a tag key for a given measurement. Otherwise the behavior is unpredictable.

Inspiried by https://groups.google.com/d/msgid/influxdb/feca028b-7c62-439c-9d17-97b58edcee92%40googlegroups.com

@corylanou
Copy link
Contributor

I need more context on how we want to fix this. There are several challenges for this depending on what the fix is.

First, let's examine the write endpoint to solve the problem.

If we check each point being written for duplicate tag/fields that will be fairly easy. While adding a little overhead, it should be fairly quick. Use of benchmarking could easily tell us what the trade off is.

However, it's important to understand that enforcing the write endpoint BEYOND the single point write is VERY costly, as you could have this scenario:

measurement,foo=bar value=12
measurement,value=bar foo=12

Because we can write any tag/field name on any insert, we would have to look up all the meta data for that set of keys/fields before accepting the write. Now, in the case where we are in the same shard (time index), this is less overhead, but in the case of truly making this correct, it would need to check EVERY shard that the measurement exists in across the cluster.

For the query endpoint, we have the same problem. To know if you have a duplicate field/tag value prior to issuing the query, we would have to check every shard where the time index of your query will hit. To be truly accurate, you would have to check every shard, otherwise sometimes the query may work, and sometimes not, as it may not run into a collision in one time range, but could in another.

So the only feasible fix imo is on the single point write, but beyond that the cost seems way to high.

/cc @pauldix @otoolep @beckettsean

@otoolep
Copy link
Contributor

otoolep commented Nov 12, 2015

Agreed, this isn't really fixable for the reasons you outlined @corylanou -- there is no central store of tag and field metadata in our system.

@pauldix
Copy link
Member

pauldix commented Nov 12, 2015

@corylanou agree, just disallow it if they're trying to do that in the same point.

@beckettsean
Copy link
Contributor Author

@pauldix @corylanou @otoolep

Fixing it only for single point duplicates does very little to address the problem. It is still very easy for users to render their database unusable. If we can't prevent duplicate field and tag names, and that does seem to be a very hard challenge, then we must fix one or more of the following:

implement RENAME FIELD - #4156
OR
implement RENAME TAG - #4157

OR

make it possible to query tag and field of the same name - #4630
AND
make it possible to group by tag with field of the same name - #4217
AND
make it possible to use the tag or the field in the WHERE clause, explicitly declaring which is intended. (No issue filed yet)

So, either we make it possible to rename a tag or field, so that we can undo a bad situation, or we make it possible to explicitly declare in the query language which column to use when querying the duplicate names, so we can work around a bad situation.

Otherwise it is possible to make a single write to the database that causes huge chunks of data to become un-queryable forever, with no workaround or mitigation. That is not acceptable behavior for a database, regardless of the performance or architectural concerns involved.

@otoolep
Copy link
Contributor

otoolep commented Nov 12, 2015

There is a third option -- allow users to specify up front what the schema is.

#3006

@corylanou corylanou modified the milestones: 0.9.5, 0.9.4 Nov 16, 2015
@corylanou
Copy link
Contributor

I vote for what @beckettsean said about supporting RENAME FIELD and RENAME TAG. I don't think there is any real value in dup checking a single line for field/tag names. It adds unnecessary overhead to a very critical performance path and fixes only the furthest of edge case that we are really trying to solve for.

@pauldix
Copy link
Member

pauldix commented Nov 18, 2015

rename isn't going to be tractable because those names existing in the underlying TSM files. It would necessitate rewriting every file across the entire database that has a reference to that name. We're unlikely to support renaming features for a long time.

Using INTO queries they could potentially query the old data and write into a new series, but that could be tricky too.

@beckettsean
Copy link
Contributor Author

@pauldix I understand that none of the solutions are easy, but we have to do something that either prevents the issue from happening or allows for recovery after the fact. Otherwise it's a one way ticket to data loss.

I like @gunnaraasen's proposal in #4823, and that seems more tractable than either preventing duplicates or renaming them.

@pauldix
Copy link
Member

pauldix commented Nov 18, 2015

@beckettsean yeah, I think that's the way to go. I'm going to close this out in favor of giving the user explicit controls at query time.

@pauldix pauldix closed this as completed Nov 18, 2015
@jsternberg
Copy link
Contributor

I should be able to modify the query engine to handle queries across fields and tags so that it looks correct, but the only requirement would be to prevent a single point from containing a duplicate. I can make this work:

cpu,host=server01 value=2
cpu host="server02",value=3

I don't think I can make this work:

cpu,host=server01 host="server01",value=2

Should I reopen this ticket to prevent points from doing the latter?

@corylanou
Copy link
Contributor

Since you can't prevent it in all cases, I think the conclusion above is still the correct one, give them better query controls. It's a very poor design choice for the end user to have both tags and fields with the same name imo, but if we give them a way to query it then it's up to them.

@pauldix
Copy link
Member

pauldix commented Mar 16, 2016

@corylanou yep agree. As long as it's queryable later, should be fine. Give users a length of rope and a knife to cut it ;)

@shakefu
Copy link

shakefu commented Jul 28, 2018

Is there any way to prevent this (perhaps a database configuration option) now on Influx 1.5.x or 1.6.x? We just had an issue with this that caused several hours of data loss.

Edit: The SHOW FIELD KEYS and SHOW TAG KEYS queries seem to indicate that there are now central metadata stores for the data points which could be used.

@gunnaraasen
Copy link
Member

@shakefu InfluxDB will accept overlapping tag and field names for writes so this issue will not cause data loss. The tag and field name overlap only affects queries. You'll need to use the identifier::[tag|field] syntax described here to specifically select the tag or field you desire.

@shakefu
Copy link

shakefu commented Aug 1, 2018

@gunnaraasen Yes, I'm aware, but I want to prevent that - because if someone accidentally writes a field with the same name as a tag, it breaks all the queries that don't specify ::tag or ::field (also, Grafana, which we used extensively with InfluxDB, doesn't support ::tag or ::field identifiers with its InfluxDB plugin without manually editing the query.)

@gunnaraasen
Copy link
Member

I recommend opening a feature request to add support for ::tag/::field in the Grafana query builder.

Preventing duplicate tag/field names would still require an unfeasible/inefficient check on every write even with the new index. We have no plans to add a setting to disallow these kinds of writes, but we are considering how this could be addressed with future API changes.

@shakefu
Copy link

shakefu commented Aug 3, 2018

@gunnaraasen I'm not aware of the internals of the new index - but if it's maintaining a list of the fields and tags, couldn't those be placed in a hash filter for an efficient check? Our issue goes beyond Grafana because we have tens of thousands of lines of code referencing field/tag names without qualifiers that we'd need to update.

Or perhaps a configurable whitelist/blacklist for fields or tags?

@daliborfilus
Copy link

I just stumbled upon this problem in one system I "inherited".
Some of our scripts insert tag keys and fields with the same key/name and I was in awe that it's even possible. (I always wondered why our queries returned "status,status_1" and the like.)

Anyway, I tried to fix it by removing the values. But with that fix, the queries weren't working correctly.
It was because that the tag values and field values existed with the same name, the inserts inserted now only to the tags (using line protocol), but the select query looks into fields first.

I'm sorry if it was already said somewhere, but... can I ask you why does it make more sense to filter by value first, when the same name exists as a tag? Searching via tags is faster, right?

select * from x where status = 'x';

takes the same time as

select * from x where status::field = 'x';

but

select * from x where status::tag = 'x';

returns instantly. That's how I discovered this behavior.

InfluxDB 1.8.9, x64, CentOS 7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants