Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate column name bug #4353

Closed

Conversation

Jakdaw
Copy link
Contributor

@Jakdaw Jakdaw commented Nov 13, 2019

  • Bug Fix

Description

The existing code to ensure that we don't have duplicate column names has a bug if it picks a new column name that's the same as an existing column. This can be achieved with:

select 1 a, 2 a1, 3 a;

Changing if to while fixes this.

@@ -105,7 +105,7 @@ def fetch_columns(self, columns):

for col in columns:
column_name = col[0]
if column_name in column_names:
while column_name in column_names:
column_name = "{}{}".format(column_name, duplicates_counter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this code will work. But it may generate strange column names.

Assume we have query result with columns a a1 a12 a - last column needs unique name. This code will generate a123 (a (used) -> adds '1' -> a1 (used) -> adds '2' -> a12 (used) -> adds '3' -> a123, while just a2 would be totally fine.

What if on each iteration you will use original columns name + counter instead of concatenating counter to result from previous iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I thought the situations where this might happen were pretty contrived so exactly what's chosen not too important; but wanted to fix it because the client code really relies upon the names being unique, and it's behaviour when they're not is just plain wrong!

Yes what you're suggesting sounds like an improvement, although you'd still have weirdness, like:

(a, a, a1) would map to (a, a1, a11)

... so we could perhaps alternatively, just drop the numbers altogether and say adding "_" characters until it's unique.

(a, a, a_) would then map to (a, a_, a__)

If I was going to be really pedantic I might want to limit column renames to columns that had a duplicate name before any of this logic was applied. So, for the previous example, have:

(a, a, a_) map to (a, a__, a_)

... then again I'm not sure whether any of these options are really much better; AND they might well mean that existing queries start to have different column names, and given all the Visualisation settings are based upon column names we're going to break people's Vis settings for existing queries by changing this behaviour. Which brings me back to where I started.... ;-)

I don't have a strong view - does anyone else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no perfect solution I guess. The improvement I suggested in previous comment will generate shorter names for non-unique columns (e.g. in my example, second a column will get name a2 instead of a123). As about your case - it also could be solved: algorithm should just sort column names (descending) and then enumerate them: so (a, a, a1) would be checked in order (a1, a, a) and mapped to (a1, a2, a). WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @kravets-levko suggestion of sorting + using a single counter instead of appending (the end result will be less confusing to the user).

@guidopetri
Copy link
Contributor

@Jakdaw , thanks for the PR! We've updated a lot of things now that we're Community-driven so - if you're still interested in getting this merged - would you mind rebasing off master to re-run the CI, as well as updating merge conflicts?

We're trying to clean up our PR todo list, so if you're not interested, that's fine - we'll close the PR in about a week if we don't hear back. If you're interested in reopening the PR afterwards, we would also very much welcome that.

@guidopetri
Copy link
Contributor

Closing in favor of #4600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants