-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should not allow field and tag with same key #2615
Comments
I need more context on how we want to fix this. There are several challenges for this depending on what the fix is. First, let's examine the If we check each point being written for duplicate tag/fields that will be fairly easy. While adding a little overhead, it should be fairly quick. Use of benchmarking could easily tell us what the trade off is. However, it's important to understand that enforcing the write endpoint BEYOND the single point write is VERY costly, as you could have this scenario:
Because we can write any tag/field name on any insert, we would have to look up all the meta data for that set of keys/fields before accepting the write. Now, in the case where we are in the same shard (time index), this is less overhead, but in the case of truly making this correct, it would need to check EVERY shard that the measurement exists in across the cluster. For the query endpoint, we have the same problem. To know if you have a duplicate field/tag value prior to issuing the query, we would have to check every shard where the time index of your query will hit. To be truly accurate, you would have to check every shard, otherwise sometimes the query may work, and sometimes not, as it may not run into a collision in one time range, but could in another. So the only feasible fix imo is on the single point write, but beyond that the cost seems way to high. |
Agreed, this isn't really fixable for the reasons you outlined @corylanou -- there is no central store of tag and field metadata in our system. |
@corylanou agree, just disallow it if they're trying to do that in the same point. |
Fixing it only for single point duplicates does very little to address the problem. It is still very easy for users to render their database unusable. If we can't prevent duplicate field and tag names, and that does seem to be a very hard challenge, then we must fix one or more of the following: implement OR make it possible to query tag and field of the same name - #4630 So, either we make it possible to rename a tag or field, so that we can undo a bad situation, or we make it possible to explicitly declare in the query language which column to use when querying the duplicate names, so we can work around a bad situation. Otherwise it is possible to make a single write to the database that causes huge chunks of data to become un-queryable forever, with no workaround or mitigation. That is not acceptable behavior for a database, regardless of the performance or architectural concerns involved. |
There is a third option -- allow users to specify up front what the schema is. |
I vote for what @beckettsean said about supporting |
rename isn't going to be tractable because those names existing in the underlying TSM files. It would necessitate rewriting every file across the entire database that has a reference to that name. We're unlikely to support renaming features for a long time. Using INTO queries they could potentially query the old data and write into a new series, but that could be tricky too. |
@pauldix I understand that none of the solutions are easy, but we have to do something that either prevents the issue from happening or allows for recovery after the fact. Otherwise it's a one way ticket to data loss. I like @gunnaraasen's proposal in #4823, and that seems more tractable than either preventing duplicates or renaming them. |
@beckettsean yeah, I think that's the way to go. I'm going to close this out in favor of giving the user explicit controls at query time. |
I should be able to modify the query engine to handle queries across fields and tags so that it looks correct, but the only requirement would be to prevent a single point from containing a duplicate. I can make this work:
I don't think I can make this work:
Should I reopen this ticket to prevent points from doing the latter? |
Since you can't prevent it in all cases, I think the conclusion above is still the correct one, give them better query controls. It's a very poor design choice for the end user to have both tags and fields with the same name imo, but if we give them a way to query it then it's up to them. |
@corylanou yep agree. As long as it's queryable later, should be fine. Give users a length of rope and a knife to cut it ;) |
Is there any way to prevent this (perhaps a database configuration option) now on Influx 1.5.x or 1.6.x? We just had an issue with this that caused several hours of data loss. Edit: The SHOW FIELD KEYS and SHOW TAG KEYS queries seem to indicate that there are now central metadata stores for the data points which could be used. |
@shakefu InfluxDB will accept overlapping tag and field names for writes so this issue will not cause data loss. The tag and field name overlap only affects queries. You'll need to use the |
@gunnaraasen Yes, I'm aware, but I want to prevent that - because if someone accidentally writes a field with the same name as a tag, it breaks all the queries that don't specify ::tag or ::field (also, Grafana, which we used extensively with InfluxDB, doesn't support ::tag or ::field identifiers with its InfluxDB plugin without manually editing the query.) |
I recommend opening a feature request to add support for Preventing duplicate tag/field names would still require an unfeasible/inefficient check on every write even with the new index. We have no plans to add a setting to disallow these kinds of writes, but we are considering how this could be addressed with future API changes. |
@gunnaraasen I'm not aware of the internals of the new index - but if it's maintaining a list of the fields and tags, couldn't those be placed in a hash filter for an efficient check? Our issue goes beyond Grafana because we have tens of thousands of lines of code referencing field/tag names without qualifiers that we'd need to update. Or perhaps a configurable whitelist/blacklist for fields or tags? |
I just stumbled upon this problem in one system I "inherited". Anyway, I tried to fix it by removing the values. But with that fix, the queries weren't working correctly. I'm sorry if it was already said somewhere, but... can I ask you why does it make more sense to filter by value first, when the same name exists as a tag? Searching via tags is faster, right? select * from x where status = 'x'; takes the same time as select * from x where status::field = 'x'; but select * from x where status::tag = 'x'; returns instantly. That's how I discovered this behavior. InfluxDB 1.8.9, x64, CentOS 7 |
We should raise an error if a field key has the same name as a tag key for a given measurement. Otherwise the behavior is unpredictable.
Inspiried by https://groups.google.com/d/msgid/influxdb/feca028b-7c62-439c-9d17-97b58edcee92%40googlegroups.com
The text was updated successfully, but these errors were encountered: