-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix duplicate column name bug #4353
Conversation
@@ -105,7 +105,7 @@ def fetch_columns(self, columns): | |||
|
|||
for col in columns: | |||
column_name = col[0] | |||
if column_name in column_names: | |||
while column_name in column_names: | |||
column_name = "{}{}".format(column_name, duplicates_counter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this code will work. But it may generate strange column names.
Assume we have query result with columns a a1 a12 a
- last column needs unique name. This code will generate a123
(a (used) -> adds '1' -> a1 (used) -> adds '2' -> a12 (used) -> adds '3' -> a123
, while just a2
would be totally fine.
What if on each iteration you will use original columns name + counter instead of concatenating counter to result from previous iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I thought the situations where this might happen were pretty contrived so exactly what's chosen not too important; but wanted to fix it because the client code really relies upon the names being unique, and it's behaviour when they're not is just plain wrong!
Yes what you're suggesting sounds like an improvement, although you'd still have weirdness, like:
(a, a, a1) would map to (a, a1, a11)
... so we could perhaps alternatively, just drop the numbers altogether and say adding "_" characters until it's unique.
(a, a, a_) would then map to (a, a_, a__)
If I was going to be really pedantic I might want to limit column renames to columns that had a duplicate name before any of this logic was applied. So, for the previous example, have:
(a, a, a_) map to (a, a__, a_)
... then again I'm not sure whether any of these options are really much better; AND they might well mean that existing queries start to have different column names, and given all the Visualisation settings are based upon column names we're going to break people's Vis settings for existing queries by changing this behaviour. Which brings me back to where I started.... ;-)
I don't have a strong view - does anyone else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no perfect solution I guess. The improvement I suggested in previous comment will generate shorter names for non-unique columns (e.g. in my example, second a
column will get name a2
instead of a123
). As about your case - it also could be solved: algorithm should just sort column names (descending) and then enumerate them: so (a, a, a1) would be checked in order (a1, a, a) and mapped to (a1, a2, a). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @kravets-levko suggestion of sorting + using a single counter instead of appending (the end result will be less confusing to the user).
@Jakdaw , thanks for the PR! We've updated a lot of things now that we're Community-driven so - if you're still interested in getting this merged - would you mind rebasing off master to re-run the CI, as well as updating merge conflicts? We're trying to clean up our PR todo list, so if you're not interested, that's fine - we'll close the PR in about a week if we don't hear back. If you're interested in reopening the PR afterwards, we would also very much welcome that. |
Closing in favor of #4600 |
Description
The existing code to ensure that we don't have duplicate column names has a bug if it picks a new column name that's the same as an existing column. This can be achieved with:
select 1 a, 2 a1, 3 a;
Changing
if
towhile
fixes this.