Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(types): add some types based on MonkeyType #1152

Closed
wants to merge 3 commits into from

Conversation

henryiii
Copy link
Contributor

Typing is a bit too sparse to run mypyc, so I've tried to fill in a couple of modules (lark.py and utils.py) with the help of monkeytype. This is based on what was seen running the tests more than what things should be, so input and cleanup would be helpful!

Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>
lark/parsers/lalr_parser.py Outdated Show resolved Hide resolved
return isinstance(value, self.types_to_memoize)

def serialize(self):
def serialize(self) -> Dict[int, Any]: # type: ignore[override]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time I just ignored the mismatch.

@@ -134,7 +151,8 @@ def get_regexp_width(expr):
raise ImportError('`regex` module must be installed in order to use Unicode categories.', expr)
regexp_final = expr
try:
return [int(x) for x in sre_parse.parse(regexp_final).getwidth()]
# TODO: bug? Returns an int?
return [int(x) for x in sre_parse.parse(regexp_final).getwidth()] # type: ignore[attr-defined]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sre_parse.parse(regexp_final).getwidth() returns an int, which you can't iterate over. Not sure what was intended here.

Copy link
Member

@erezsh erezsh May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getwidth() returns a tuple of (min_width, max_width)

>>> sre_parse.parse('a+').getwidth()
(1, 4294967295)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's quite an obscure feature. Most re libraries in other languages don't even have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in typeshed now, but missed today's mypy 0.960 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, can't resolve until I fix the comment (tomorrow).

lark/utils.py Outdated Show resolved Hide resolved
Comment on lines -182 to +204
return [x for x in l if not (x in dedup or dedup.add(x))]
# This returns None, but that's expected
return [x for x in l if not (x in dedup or dedup.add(x))] # type: ignore[func-returns-value]
# 2x faster (ordered in PyPy and CPython 3.6+, gaurenteed to be ordered in Python 3.7+)
# return list(dict.fromkeys(l))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This surprises mypy, since it's using the None return. But I think using list(dict.fromkeys(l)) is simpler and faster (2x when I was trying it on a million ints).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't write this function, but looks like for someone else it was the opposite. Don't know why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was before 3.6+ was required, like during the 2.7 days, then this wouldn't work. And OrderedDict is 2x slower (OrderedDict and the current implementation are nearly identical for me). I'm not testing on the actual data, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test it on heavily duplicated lists?

It's possible that fromkeys() keeps overwriting existing values, making it slower in that case.

(Just guessing, no idea if it's true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 1,000,000 items with 100 unique values.

Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>
Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>
erezsh added a commit that referenced this pull request Sep 15, 2022
@erezsh
Copy link
Member

erezsh commented Sep 15, 2022

@henryiii I'm sorry it took me so long to review this PR! I will try to be more prompt in the future.

I have two issues with this PR:

  • pickle.dump / load do not limit the input to BytesIO, so I don't think that we should either.
  • utils.py should not be aware of any Lark-specific types, it's a violation of concerns.
    • If we need more specificity for functions like bfs() or classify(), it should be done through generics. e.g. bfs( Sequence[T] ) -> Iterator[T]
    • If serialize/deserialize can't be expressed effectively using generics (though I think they should?), it's better to move them to a separate module like serialize.py

I created a new PR where I revert only these changes. Let me know if you agree with it - #1191

@henryiii
Copy link
Contributor Author

That's fine, I'm assuming this will be an iterative process, and that's fine. :)

@henryiii henryiii closed this Sep 16, 2022
erezsh added a commit that referenced this pull request Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants