Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for duplicate dictionary keys #72

Merged
merged 1 commit into from
Jul 24, 2016

Conversation

geokala
Copy link
Contributor

@geokala geokala commented Jun 11, 2016

No description provided.

@geokala geokala force-pushed the dict-duplicate-keys-check branch 9 times, most recently from 223f86e to 0c2a6e6 Compare June 11, 2016 18:56
return tuple(self.convert_to_value(i) for i in item.elts)
elif isinstance(item, ast.Num):
return item.n
elif isinstance(item, ast.Name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im pretty sure this fails with instance properties and other attributes of a name.
Could you add it to the test suite , and discard names if I am correct about them.

Copy link
Contributor Author

@geokala geokala Jun 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I am understanding correctly:
e.g.

class Test(object):
    pass
a = Test()
b = Test()
a.something = 'yes'
{
  a.something: 1,
  a.something: 1,
}

If your assumption is correct, are you expecting this to fail (not detect a duplicate) or fail (traceback)?
I'll give it a test assuming that you meant the above for the moment (good test to have anyway), and if I misunderstood just let me know and I'll deal with it either later today, or tomorrow evening.

Apologies if I'm misunderstanding something trivial- this it the first time I've done much with the AST.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect it to find duplicate 'a' even if different attributes of a are used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't seem to find either, as it was an ast.Attribute object. Those are handled now, however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note as this still appears in its own thread on the PR: These are no longer handled- see remaining comments)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thx, I wish there was a way to force a pr thread to be marked as done. I find github to be very confusing for in-depth code reviewing with multiple incremental revisions being made. The result seems to be that committers tend to merge early and do fixups after, creating churn and often bugs.)

@geokala geokala force-pushed the dict-duplicate-keys-check branch from 0c2a6e6 to c01aecd Compare June 12, 2016 17:31
@geokala
Copy link
Contributor Author

geokala commented Jun 12, 2016

I just added some extra logic for dealing with instance attributes, then realised that I'm also missing logic for dictionary keys (e.g. {aaa['something']: 1, aaa['something']: 1}), and missing tests for the new API elements (e.g. the VariableKey class and the convert_to_values method).

I'll add these over the next couple of days.

@asmeurer
Copy link
Contributor

Attribute and getitem lookups and probably venturing too far into the false positive territory. It's possible for a.b and a[b] to return different things on each invocation.

@geokala
Copy link
Contributor Author

geokala commented Jun 13, 2016

True, but is that likely to be deliberately done in more than a tiny fraction of cases?

The only case I can see them returning different things on each invocation are where the attribute is specified as a getter and is set to return something different each time (and similarly for __getitem__).
Am I being blind in considering this extremely rare (and probably a code smell?
(bearing in mind that this shouldn't catch calls to methods such as a.something() being duplicated)

@jayvdb
Copy link
Member

jayvdb commented Jun 13, 2016

A good sanity check is running pyflakes, with your new error, over cpython, where it should only occur in tests, and probably not even there.

@asmeurer
Copy link
Contributor

Non-pure @Property attributes aren't as rare as you would hope.

@geokala
Copy link
Contributor Author

geokala commented Jun 14, 2016

Okay, I'll remove that logic from it shortly then, and get the PR updated. On the bright side, that simplifies the code!

@geokala geokala force-pushed the dict-duplicate-keys-check branch from c01aecd to f164297 Compare June 14, 2016 18:56
@geokala
Copy link
Contributor Author

geokala commented Jun 14, 2016

Updated.

Based on the existing tests, not testing VariableKey and UnhandledKey would be consistent (though perhaps I should add at least a couple of tests for VariableKey anyway- or should it be trusted that test_other will catch them?).

There also aren't currently tests for convert_to_value. I'm not sure if this is something that there should be tests for or if once again it should be expected that test_other will catch them?

@jayvdb
Copy link
Member

jayvdb commented Jun 15, 2016

Your current approach for testing is fine; the utility funcs and classes are getting exercised.

Could you add a test that doesnt raise any errors, with a few valid scenarios.
Also funcs and classes and lambda can be keys.
I dont see a test covering tuple as a key.
Also worth adding a test that 1 and 1.0 are not dup keys, which looks good in your patch but could easily be broken in the future.

Probably also nice to special case ... as a key that can appear multiple times.

And is there any reason to not extend this test to cover duplicate items in sets?

@sigmavirus24
Copy link
Member

And is there any reason to not extend this test to cover duplicate items in sets?

You mean something like:

s = set([
   'foo',
   'bar',
   'foo',
])

I think that's less warning worthy than

d = {
    'foo': 1,
    'bar': 2,
    'foo': 3,
}

Given sets have defined behaviour and there is undefined behaviour in the dictionary case.

@geokala
Copy link
Contributor Author

geokala commented Jun 15, 2016

Agreed- if I initialise a set with duplicate values I expect it to discard the duplicates. If I do it with a dictionary then I get unexpected effects (until I track down the issue by hand).

@geokala
Copy link
Contributor Author

geokala commented Jun 15, 2016

@jayvdb: I'll probably address your other comments this evening, should have an update in then.

@sigmavirus24
Copy link
Member

Probably also nice to special case ... as a key that can appear multiple times.

Wait, why? That doesn't work as ... is a singleton:

>>> {...: 'foo'}
{Ellipsis: 'foo'}
>>> {...: 'foo', ...: 'bar'}
{Ellipsis: 'bar'}

@geokala
Copy link
Contributor Author

geokala commented Jun 15, 2016

Out of interest, what is the utility of allowing ... to appear as a dictionary key multiple times?

I can't think of a situation in which it'd be useful off the top of my head.

Happy to special-case it, just a bit perplexed as to why anyone would do it in the first place.

@sigmavirus24
Copy link
Member

Out of interest, what is the utility of allowing ... to appear as a dictionary key multiple times?

I agree. I'm not sure I see the value. I'm open to an explanation for why it should be special-cased though.

@geokala geokala force-pushed the dict-duplicate-keys-check branch from f164297 to 87429d5 Compare June 15, 2016 20:01
@geokala
Copy link
Contributor Author

geokala commented Jun 15, 2016

Could you add a test that doesnt raise any errors, with a few valid scenarios.

I've added a few tests that don't raise errors, and a test to see if we pick up duplicate keys in a literal dictionary inside a lambda.

Also funcs and classes and lambda can be keys.

Functions and classes are ast.Names.
Lambdas, unless assigned to a variable (which are already caught when duplicated) don't 'count' as duplicates- see:
{ lambda x: None: 1, lambda x: None: 1, }
This actually gives a dictionary with two items.

I dont see a test covering tuple as a key.

https://github.com/pyflakes/pyflakes/pull/72/files#diff-d18f55ece21d21fb172322f7db700d6aR31

Also worth adding a test that 1 and 1.0 are not dup keys, which looks good in your patch but could easily be broken in the future.

Those are actually duplicates (tested in python2 and python2 with {1.0: 1, 1:1} ) and are picked up correctly.

@geokala geokala force-pushed the dict-duplicate-keys-check branch from 87429d5 to a93d3ce Compare June 15, 2016 20:06
@jayvdb
Copy link
Member

jayvdb commented Jun 16, 2016

wrt to ..., it is a legal expression in py3. Its use as a placeholder for code is accepted.
https://mail.python.org/pipermail/python-3000/2008-January/011793.html
Using ... is clear intent by the author (to do the wrong thing) , but it is valid syntax and Python doesnt have any attached meaning for it, so IMO pyflakes shouldnt declare invalid syntax if it is used as a placeholder twice in the same dict:

a = {
    1: 1,
    2: 2,
    3: 3,
    ...: ...,
    'a': 'a',
    'b': 'b',
    'c': 'c',
    ...: ...,
}

There is only a very small chance we're helping anyone with that being an error. If someone is using ... in a dict, they shouldnt be expecting Python syntax checkers to be assisting them. It cant be a typo as far as I can see, as there is no valid syntax which is lexically similar to .... They intended to use it, their intent indicates the code is not production code, it is valid to use it, so why error?

If pyflakes is going to error when ... is used in that dict, we should error whenever it appears in the dict (not just the first time), or possibly anywhere in code except in slices. But that is assuming a meaning for ..., and shouldnt be done by pyflakes IMO (but would be very useful , similar to flake8_print and similar which check that code is ready for promoting).

This is in contrast to the other similar constant NotImplemented, which has no placeholder value and it is reasonable to have it appear in a dict of real production code, and only one dict entry would be valid.

@geokala
Copy link
Contributor Author

geokala commented Jul 16, 2016

@jayvdb: Sorry- I must've misunderstood. Previous behaviour (up until the change I just pushed) was to raise only one complaint per set of duplicated keys with differing values (e.g. { 1: 1, 1: 2 } would raise one complaint). I've modified it now so it should behave as you expect.

@lamby: I've also put in the change you and @sigmavirus24 suggested above (well, you suggested, with addendum, etc, etc). It could actually be changed to == 1 (it can't ever be zero for the comparison it's doing), but I'm not sure if that gives any real improvement.

differing_values = True
break

if differing_values:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another one for inverting the logic and continue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole differing_values section can be reduced to a single if [any|all](): continue , which reduces the complexity of the method by ~16%

@lamby
Copy link

lamby commented Jul 16, 2016

but i'm not sure if that gives any real improvement.

Really? Reducing indentation makes it far, far easier to follow IMHO

@jayvdb
Copy link
Member

jayvdb commented Jul 16, 2016

@geokala , thanks for including reporting of all the lines with duplicates, and adding the correct line numbers.

Could you please use %r for the literals, which will add quotes around the strings, which is quite necessary if the string contains a space, or worse if the string contains newline characters.

I'd also really like the reported literal to be same as appears in the source, rather than key 0x0020 appearing in the report output as 32. If I see a literal in a error report, I expect to be able to grep the source file for that literal. The perfect UI IMO would emit ... "0x0020" (32) ..., as then it is grepable, and two different ways of writing the same number are shown as the same number.
To do this, it looks like we need to use dict_node.keys[i].col_offset, and parse for the end of the key (and give up if we somehow reach the corresponding dict_node.values[i].col_offset ?). Or is there another way I'm not seeing? Maybe this should be a followup patch.

re complexity/indents/etc:

$ python -m mccabe pyflakes/checker.py | grep DICT
("925:1: 'Checker.DICT'", 12)

That makes it the fourth most complex method, and the second most complex node handler method. Which is not representative of the actual complexity of the task. I've noted above one way to reduce that to 10, and I suspect one or two more generators/comprehensions could be added to make the complex parts can be easy to load into the brain as a single chunk.

@geokala geokala force-pushed the dict-duplicate-keys-check branch from 29263cb to 401b012 Compare July 18, 2016 20:16
@geokala
Copy link
Contributor Author

geokala commented Jul 18, 2016

@lamby: Apologies, I was unclear in my meaning- I meant that changing it to == 1, rather than <=1, while filling the same function here (and functioning the same way currently), probably doesn't gain us anything (and may open up future bug possibilities if the logic is slightly changed). The original change, yes, no complaints. -smiles-

@jayvdb: Sorry, I must've missed that one- I recall intending to do it. Either way, it should be good now.
I think the reported literal issue for hex (and similar) is going to be worthy of a followup patch, yes. As you suggest, it is somewhat non-trivial (unless someone knows a way).
I'll do another pass now to try to get the complexity below 10.

@geokala geokala force-pushed the dict-duplicate-keys-check branch from 401b012 to 31c1bcf Compare July 18, 2016 20:27
@geokala
Copy link
Contributor Author

geokala commented Jul 18, 2016

@jayvdb: Change made per your suggestion of any(x) for differing_values.
Complexity now 10.

I'm not sure if it can be reasonably reduced in complexity now (in a way that doesn't sacrifice readability), but that may be down to my skill level and/or perceptions.

@jayvdb
Copy link
Member

jayvdb commented Jul 19, 2016

Take a look at jayvdb@3b29b50 for some ideas on how to cut complexity to 5.

@sigmavirus24
Copy link
Member

@jayvdb I don't know of a specific guideline for what the complexity of the code should be. If there is one, we should add it to Travis CI and document it for future contributors. At this point, playing complexity golf with @geokala seems rude. If you have the complexity that low in your own fork, feel free to submit it as a pull request after this one. I think @geokala has been more than patient with us here.

@sigmavirus24
Copy link
Member

Since we're just golfing over complexity, I'm going to run this over some more code-bases today and if all seems good, I'm going to merge this.

If there are reviews other than complexity golfing, please do leave them. I think this pull request has lived for quite long enough. Other style nits should be submitted by people after merge.

@geokala
Copy link
Contributor Author

geokala commented Jul 19, 2016

@sigmavirus24 Sounds fair. I think @jayvdb's changes were quite elegant (I should've guessed collections would have something to simplify what I was trying to do there), but if you're happy running over this and then determining complexity intent later that's cool.

Also, yes, I do think determining a complexity guideline and testing for it (though maybe with an easy and obvious way to add exceptions) might be a good idea. I'm fairly sure flake8 supports this natively (as I think I actually turned it on at the start of a project in a previous workplace).

Either way, if there are any other issues, I'm happy to address them!

@jayvdb
Copy link
Member

jayvdb commented Jul 19, 2016

My review comments are not just style nits. They are also performance improvements.

@sigmavirus24
Copy link
Member

@jayvdb there is one review comment here unrelated to complexity that @geokala has kindly addressed. Unless there are more comments that are not being displayed by GitHub, all your existing comments relate to complexity.

@sigmavirus24
Copy link
Member

Also @geokala if you want to adopt @jayvdb's changes, go ahead. IMO, you don't have to keep playing complexity golf though. You were asked to make it no higher than 10 and did so.

@geokala
Copy link
Contributor Author

geokala commented Jul 19, 2016

I'm happy to adopt them, though it'll probably be late tonight before I get to them.

@jayvdb
Copy link
Member

jayvdb commented Jul 19, 2016

Ian, I think you have misread my comment earlier about '10' and if you look briefly at my diff you will see it is performance improvements implementing the additional generators/comprehensions I had indicated earlier would help get the complexity below 10, down to a reasonable level. The current version is twice reimplementing a Counter, using list scans. It is best practise when doing code reviews to not worry about those kind of review comments until the patch is functionally ok. I put a diff up because that was easier than trying to explain how to fix it, which would have been a longer and more aggravating process for everyone involved, as it involved rearranging the code significantly. But the code isnt mine, and I wouldnt want to submit a rewrite after this and have the git blame attribute me. If geokala can put a little more polish on their patch, the merged code should be stable until we work on the outstanding problem of printing the same representation as was used in the source.

@jayvdb
Copy link
Member

jayvdb commented Jul 20, 2016

While looking at pyflakes output, it occurred to me that this feature is basically identical to the existing RedefinedWhileUnused which emits "filename.py:875: redefinition of unused 'xxx' from line 863 , except it is a prior dict key value that is unused rather than a variable value.

If there are any complaints after this is release that this is another case of pyflakes violating its contract with users and producing an error on valid code, RedefinedWhileUnused is a simple way to demonstrate that there is precedent for it.

Also, if we wanted to only emit only one error line for a pair of repeated keys, using the same suffix of RedefinedWhileUnused being ... from line x referring at the prior & different use of the key would be the way to do it. Note: I'm not asking for, or advocating for, it to be modified; im just saying that would be reasonable IMO.
The current approach of emitting two messages for a pair of repeated keys is a bit verbose, but it is my preference as it works better with the flake8 --show-source as both lines with the different values are shown.

@geokala geokala force-pushed the dict-duplicate-keys-check branch from 31c1bcf to aa39997 Compare July 20, 2016 19:38
@geokala
Copy link
Contributor Author

geokala commented Jul 20, 2016

Hmm... collections has no Counter in 2.6... @jayvdb, do we have a good alternative other than using a backport?

@jayvdb
Copy link
Member

jayvdb commented Jul 21, 2016

Sure. We only need a dict with the value incrementing, as we dont need any of the fancy Counter features.
Best to build it as a fallback class for when the real Counter doesnt exist, but it could also be a normal (default)dict incrementing values using a for loop.

@geokala geokala force-pushed the dict-duplicate-keys-check branch from aa39997 to fcf4c6b Compare July 24, 2016 14:08
@geokala
Copy link
Contributor Author

geokala commented Jul 24, 2016

@jayvdb: Okay, I've put in a counter method and just used it, rather than putting extra complexity in to determine whether to use it.

@sigmavirus24 (copying you on this if you want to give it another look and/or if you're both happy to merge?)

@jayvdb jayvdb merged commit 152ca18 into PyCQA:master Jul 24, 2016
@jayvdb
Copy link
Member

jayvdb commented Jul 24, 2016

Nice work. Thank you so much for sticking with it.

@geokala
Copy link
Contributor Author

geokala commented Jul 25, 2016

No worries- thanks to all of you for your comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants