-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False positive consider-using-tuple
for x in set
, because frozen sets are faster
#4776
Comments
Hey thank you for opening the issue. Do you have a source for the performance comparison between the two ? |
>>> def in_set(a): return a in {"abc", "def", "ghi", "klm", "nop"}
...
>>> def in_tuple(a): return a in ("abc", "def", "ghi", "klm", "nop")
...
>>> from timeit import timeit
>>> timeit(lambda: in_set("foobar"))
0.08363517999532633
>>> timeit(lambda: in_tuple("foobar"))
0.1204142420028802 Using tuples (or lists for that matter) instead of sets require that comparisons are done sequentially until the member is found or the tuple. Also, using tuples or lists requires placing the expected-most-often-found elements first as they will be compared first, while in a set a hash comparison will find the matching element (or the absence of the key) in few operations. As is documented in Python source
Looking up a non-existing member is the worst case when using tuples or lists because all elements must be compared to the looked-up key before declaring it a miss, while sets are also optimized for this use-case. |
Note that a set lookup is amortized O(1) while tuple and list lookup is O(n). However, when n is small of course it makes no sense to compare costs this way, and this is not what I'm doing in the comment above. Python set objects (and particularly frozenset objects) are optimized for this particular use case including for small collections with very few objects. Also, I've seen many code fragments looking like: if s in {"abc", "def", "ghi"}:
do_something(s)
elif s in {"jkl", "mno", "por"}:
do_something_else(s) One must realize that |
consider-using-tuple
for
x in set
`, because frozen sets are faster
consider-using-tuple
for
x in set
`, because frozen sets are fasterconsider-using-tuple
for x in set
, because frozen sets are faster
As I have explained here before, the main reason for this change wasn't performance, but rather code style. That's also why I added it to the optional CodeStyleExtension. As it says in the documentation
The main goal is to improve code consistency, even if not always the most absolute most performant. --
-- -- from pympler import asizeof
t = ("aa", "bb", "cc", "dd")
s = {"aa", "bb", "cc", "dd"}
s2 = frozenset(t)
print(f"Tuple: {asizeof.asizeof(t)}")
print(f"Set: {asizeof.asizeof(s)}")
print(f"Frozenset: {asizeof.asizeof(s2)}")
--
Tuple: 296
Set: 440
Frozenset: 440 |
That's where our opinions differ I guess. I do not think it is ok to suggest bad practices to a user, especially a beginner, just for the sake of "code consistency". Someone using a set for a membership test is doing nothing wrong and is using the right tool, I would not understand why we would actively suggest using something inferior. |
As I explained, it isn't inferior. You only considered the performance aspect, which isn't the whole picture in this case. |
Also you write
This is easy to check: >>> import dis
>>> def f():
... if s in {"abc", "def", "ghi"}:
... do_something(s)
... elif s in {"jkl", "mno", "por"}:
... do_something_else(s)
...
>>> dis.dis(f)
2 0 LOAD_GLOBAL 0 (s)
2 LOAD_CONST 1 (frozenset({'def', 'ghi', 'abc'}))
4 CONTAINS_OP 0
6 POP_JUMP_IF_FALSE 18
3 8 LOAD_GLOBAL 1 (do_something)
10 LOAD_GLOBAL 0 (s)
12 CALL_FUNCTION 1
14 POP_TOP
16 JUMP_FORWARD 16 (to 34)
4 >> 18 LOAD_GLOBAL 0 (s)
20 LOAD_CONST 2 (frozenset({'mno', 'jkl', 'por'}))
22 CONTAINS_OP 0
24 POP_JUMP_IF_FALSE 34
5 26 LOAD_GLOBAL 2 (do_something_else)
28 LOAD_GLOBAL 0 (s)
30 CALL_FUNCTION 1
32 POP_TOP
>> 34 LOAD_CONST 0 (None)
36 RETURN_VALUE |
I didn't question that behavior. But consider the following, not everyone knows about it. |
This is wrong, the performance aspects bother me less than the bad practices ones: one should not actively suggest using sequential scanning for membership tests. This is CS 101. Even the official Python tutorial for set (https://docs.python.org/3/tutorial/datastructures.html) presents the membership test as "fast membership testing", i.e. the right tool for this. I'm not asking for the reverse suggestion to be added although I think this would be a good idea, I'm just asking not to suggest the user to change their code from set to tuple in the case of membership test. |
The fact that it is a If small sets were built at execution time each time a membership tests was done, I would totally agree with you that tuples would be better, but this is not the case. |
By the way, sorry if I sound passionate about the issue, but with my CS teacher hat on I would hate to see my students change their code from a static set membership test to a tuple one, and it would affect their grade negatively if they had done so without a good justification for using a sequentially scanning tool instead of a hashing one. |
Thank you @samueltardieu for the discussion about this topic. I took some time to think about it more and looked at some more code. As it turns out Python is better optimized than I was aware of (although I should probably have expected as much). That made me question if the whole import dis
def f(s):
if s in ["a", "b", "c"]:
return True
# ---
>>> dis.dis(f)
50 0 LOAD_FAST 0 (s)
2 LOAD_CONST 1 (('a', 'b', 'c'))
4 CONTAINS_OP 0
6 POP_JUMP_IF_FALSE 6 (to 12)
51 8 LOAD_CONST 2 (True)
10 RETURN_VALUE
50 >> 12 LOAD_CONST 0 (None)
14 RETURN_VALUE As you can see from the example above, even lists are internally converted to tuples 🤦🏻 Same is true for --
-- -- --
-- |
Indeed, Python prefers to build read-only versions ( Concerning lists vs tuples, as a fan of immutable data structures, I would say that using tuples instead of lists where possible is best because it encourages people at thinking about immutable structures; however, as a long-time Pythonista myself, I tend to use lists because they look nicer than tuples, especially when they have only one element as they don't require the extra comma, and as you mention I would keep the check warning about the use of in-place sets in iterations and suggest using tuples or lists, as I cannot think of any good reason for using a set there. Moreover, switching to lists or tuples will not only be more efficient but also give a consistent traversal order across Python interpreters and versions, while the set guarantees nothing in this domain. |
Just a quick note, if data structures get compiled to other, optimized ones (such as lists to tuples and sets to frozensets like they are shown to be here), but nothing documents such behavior and that it's going to stay, then relying on it would be relying on internal implementation/optimization detail which might or might not hold between Python versions and Python implementations (cpython, pypy, ...). Which isn't good. There's nothing wrong with recommending a semantically correct way to do things if there is one, and in that sense I think there's still value in this check, if the "correct" way can be agreed on, even if it would not bring any performance benefits. |
The way I understand, it sets are better performance wise for membership and worst for iterating, while list/tuple are equivalent. Let's not bother pylint's user for something that is equivalent and focus on the set vs list/tuple. How about we remove the current Multi choice vote here:
Planning wise we could remove |
I've seen literal sets used for iterating a bunch of times. IIRC all of them have been inadvertent, and some of them sources of actual bugs due to the unstable order, such as pre-commit/pre-commit#1117. So in my experience something flagging those is most welcome. Not sure which of the choices represent this wish of having some way to check it best, but I'm placing it on 🎉 above. Re Is the plan BTW to limit the check against sets for iteration for literal sets only, or all of them? There's a difference to me there; non-literal is more likely to be intentional/harmless, maybe that shouldn't cause a warning at all, dunno. |
Certainly an argument in favor of adding / keeping the check for iterations.
Maybe we should split this into two separate checks. One
I mentioned earlier that -- |
That make sense, I like that. |
Moving that issue to 2.11 as |
We realized in #4841 that some object not being hashable means that suggesting to use set is impossible for something like this: letters= ["a", "b"]
numbers = [1, 42]
if "c" in (letters, numbers,):
return What would be the preferred way for you going forward ?
|
I'd prefer not to add We have to consider it might not be obvious for some that sets require hashable items. Someone might start with a Furthermore, there is more maintenance required if we add some kind of heuristic. It won't be perfect so we would need to fix corner cases constantly. In the end we might even need a new check to recognize the error I described earlier. IMO it's just not worth it. -- |
This will be fixed by the extension |
Bug description
The change introduced in #4768 is unwarranted.
Considering the following code:
it suggests to use a tuple in both cases:
While the second suggestion is right (in-place sets should not be used for iterating over their content), the first one (added in #4768) is a very bad advice. In the case of a membership lookup a set is faster, compiled as a
frozenset
by Python, and the right tool for the job.I would suggest that this commit be reverted before the release as it encourages bad practices. A new style check could rather suggest to replace in-place lists and tuples in membership tests by sets when the data type is known to be hashable (as strings or numbers are for example).
Configuration
No response
Command used
Pylint output
Expected behavior
I would expect to see the second suggestion (which is sound), and not the first one (which is a bad advice).
Pylint version
OS / Environment
ArchLinux
Additional dependencies
No response
The text was updated successfully, but these errors were encountered: