-
-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scanning #1429
base: master
Are you sure you want to change the base?
Add scanning #1429
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I only gave it a quick look so far, but looks pretty good.
I left a few comments. I'll take a deeper look later, and I might come up with more.
Btw, can we drop support for the now completely unsupported versions 3.6 and 3.7? https://devguide.python.org/versions/ The |
Yeah, I think we can. Users of <=3.7 will just stay stuck on this version of Lark, which is fine. But I think it should be in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks really good!
There is one design idea that I have, I hope you'll keep an open mind. If we change the definition of text to text: str | TextSlice
, where TextSlice has 3 fields: text, start, end.
That would follow the principal of keeping related data together, and also make all the None checks a little less awkward ( I think ).
And as a bonus, that way we don't have to add new parameters to every function on the way.
lark/lexer.py
Outdated
if m: | ||
return m.group(0), m.lastgroup | ||
|
||
def search(self, text, start_pos, end_pos): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose a different way to write this function:
def search(self, text, start_pos, end_pos):
results = list(filter(None, [
mre.search(text, start_pos, end_pos)
for mre in self._mres
]))
if not results:
return None
best = min(results, key=lambda m: m.start())
return (best.group(0), best.lastgroup), best.start()
# We don't want to check early since this can be expensive. | ||
valid_end = [] | ||
ip = self.parse_interactive(text, start=chosen_start, start_pos=found.start_pos, end_pos=end_pos) | ||
tokens = ip.lexer_thread.lex(ip.parser_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use ip.iter_parse()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had in mind that it doesn't work, gotta check.
Yep, sounds like a good idea. I think in a later PR it might also make sense to go a step further and make |
I'm not sure what exactly you have in mind, but overall it sounds like a positive thing. Are you referring to the text also accepting bytes, or something else? |
Specifically the think I had in mind is some kind of externally generated token stream with a custom lexer. |
Hmm.. I'm not sure I get it. |
Imagine using something like stdlib's |
I am going to bed for now, but an observation I am already making is that using |
Sounds like that might be a good idea. |
@MegaIng Do you have plans to submit the TextSlice PR in the near future? If not, maybe I'll take a crack at it. |
Sorry, no, not next-few-days-soon. Forgot about this. It was a bit more work than I thought (since it also needed to touch a chunk of earley code). I am not currently on my main PC, so I can't even provide the work-in-progress, but I would be able to in ~2 days. |
Sure, I'll wait until you're back to your PC. I'm not in a rush, I'm just in a mood to work on something, and might as well do this. The scan feature is pretty cool, so I'll be happy to unblock it if I can. |
If you want to work on something, maybe you can come up with a solution of the issue of multi-path imports, as seen in this SO question? The problem is that the same terminal gets imported with different names and then ofcourse collides with "itself". I don't really know if there is a good solution to this, maybe you can come up with something. |
I gave it some thought, and it's a tricky problem. I wrote my thoughts here: #1448 |
@erezsh Pushed a version of |
An implement of
Lark.scan
. Also addsstart_pos
andend_pos
toLark.parse
,Lark.parse_interactive
andLark.lex
.TODO:
start_pos
andend_pos
mirroring the behavior of stdlibre
with regard to look behind and look ahead.But I do think the core logic is pretty stable and I would like a review of that already @erezsh.
Future work:
mmap
to not have to load the text into memory at all (also involves checking up on the byte parsing implementation)tokenize
module, which would have a few benefits especially with regard to the new f string syntax, and how well that would play with this feature.This PR is based on #1428, so merging it first would be better.