-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace the tokenizer with a flex-based scanner #3846
Conversation
Looking good. I assume we're now properly handling all the weird Flex crashes you were seeing? I also think it'd be neat if the Rakefile had a way to rebuild the lexer (and also to check that we're using the right version of Flex). |
I've not tested this yet, but I wonder how well this new tokenizer will fair with non-ASCII. From the look of things, it should be 👌 , but thought I'd ask to be sure. For context, an attempt to improve Linguist's support of non-ASCII in the ruby implementation has been started in #3748. |
Yeah; I constrained our use of features which turn out to be dangerous (LOOKING AT YOU, TRAILING CONTEXT), and everything works as expected now. ✨
+1, will add.
It'll do as well as it currently does, which is to say Not Hugely Well; non-ASCII stuff will get skipped. It wouldn't be too hard to make it grok things we're likely to see in UTF-8 text, though it'd be a lot harder to do this and only match word-characters (since we'd have to add actual Unicode understanding to our lexer at that stage). |
* Don't read and split the entire file if we only ever use the first/last n lines * Only consider the first 50KiB when using heuristics/classifying. This can save a *lot* of time; running a large number of regexes over 1MiB of text takes a while. * Memoize File.size/read/stat; re-reading in a 500KiB file every time `data` is called adds up a lot.
@@ -289,6 +287,44 @@ def lines | |||
end | |||
end | |||
|
|||
def encoded_newlines_re | |||
@encoded_newlines_re ||= Regexp.union(["\r\n", "\r", "\n"]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the \R
extension not work here?
I also take it Ruby's regex engine doesn't have the equivalent of Perl's /a
modifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm changing as little code as I can; this is just a refactor from:
\R
also catches [\v\f]
which we definitely don't want.
I also take it Ruby's regex engine doesn't have the equivalent of Perl's
/a
modifier?
~$ ruby -e '//a'
-e:1: unknown regexp option - a
It doesn't, and more to the point, it wouldn't help for our use here, which isn't about Unicode-aware matching so much as avoiding terrible encoding exceptions rising from the deep. /a
modifies the meaning of several sequences in the regular expressions itself, rather than changing how a regular expression is applied to a given byte-sequence-tagged-with-an-encoding (i.e. a String
), whatever the meaning of its contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. ;) Just thought to ask, since it's used very little in Perl (for good reasons). Thanks!
I'd like to merge this! Anyone feel like doing a final review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caveat pre-emptor: I have a copy of Dennis Richie's book but I'm far from being a C expert.
From what I do know this looks good to me, and the perf improvement is fantastic!!
@lildude Thank you! The responsibility is mine if this somehow goes belly-up. |
Preliminary benchmarks put this in at a 12x speedup.
It doesn't produce identical results, but very near enough to. (Enough that all the tests should pass.)
/cc @vmg because he luuuuurves C