-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular expressions #161
Comments
Regex interning could be done by reusing interned strings and mapping these to their regex objects. This way re-using the same regex literals would result in only 1 heap allocation. Since all of this happens when parsing bytecode it won't have any runtime overhead (once all modules are loaded). |
Rust's regex crate would be the most likely underlying engine used for this. I in particular like its support for limiting the size of compiled regular expressions, which would be an interesting feature to expose to the Inko runtime. The VM instructions we'd need (that I can think of) would be:
|
Per rust-lang/regex#362 it seems the
|
One option might be to ditch supporting regular expressions in the standard library, instead deferring this to a C library used through the FFI API. Alternatively, we could perhaps support Rosie: |
Rosie appears to be offloading most of its work to Lua, with the C library being a wrapped around the Lua embedding API (if I'm reading everything properly). This may result in undesirable overhead in the context of using Rosie from Inko. |
Rosie seems like an unlikely candidate, but the underlying idea (reusable and testable PEG grammars) is very interesting. There are currently two ideas I'm toying with:
Syntax grammarsSyntax based grammars would be more verbose compared to regular expressions, but also easier to read. Let's say we want to parse an ISO 8601 date. Using regular expressions, we may end up with something like the following:
The syntax based grammar would instead look something like the following:
Here
A big downside of syntax grammars is the work necessary to implement them. This would require the compiler to parse and verify the syntax, and generate the necessary VM instructions or Inko code to parse the input efficiently. Using an existing Rust crate might allow us to offload all of this to said crate at runtime, but thus far I have been unable to find a well maintained crate for parsing PEG at run time; most only support parsing them at compile time. Parsing combinatorsParsing combinators are just methods that take some kind of input, and return a parser for that input. To parse the same ISO 8601 date mentioned above, you may write the following:
While this approach is much more verbose, it has several benefits:
There are also a few downsides to parser combinators:
|
Also worth adding: Regular expressions are great for very simple patterns, such as PEG grammars and parsing combinators (regardless of the parsing algorithm used) sacrifice compactness for readability and maintainability. This means that simple PEGs/combinators might be more verbose compared to the equivalent regular expressions, but they (in my opinion) tend to be equally readable when used for very complex parsers. In other words: the "startup cost" of PEGs/combinators is a bit higher, but it stays more or less the same. Regular expressions on the other hand start out fairly straightforward, but become more painful to work with very quickly. |
added to epic &1 |
There are three questions to think about here:
In all cases the answer comes down to "That depends", but I used GitLab CE to try and get some more data. In CE, there are a total of 556 regular expressions in application code (excluding configuration and tests). I used the following script for this: Ruby script# frozen_string_literal: true
# gem install parallel
require 'ripper'
require 'parallel'
patterns =
Parallel
.map(Dir['./{app,lib}/**/*.rb'], in_processes: 4) do |path|
found = []
buffer = ''
in_regex = false
Ripper.lex(File.read(path)).each do |token|
if token[1] == :on_regexp_beg
in_regex = true
end
if in_regex
buffer += token[2]
end
if token[1] == :on_regexp_end
in_regex = false
found << buffer.gsub("\n", '\n')
end
end
found
end
.inject([]) { |acc, curr| acc.concat(curr) }
buckets = Hash.new(0)
patterns.each do |pattern|
bucket =
if pattern.length >= 100
pattern.length.round(-2)
else
pattern.length.round(-1)
end
buckets[bucket] += 1
end
puts 'Bucket Amount'
buckets.sort.each do |(bucket, value)|
puts "#{bucket.to_s.ljust(6, ' ')} #{value.to_s.ljust(6, ' ')} #{'▮' * value}"
end Using this on CE, the output is: Output
Going through the individual regular expressions, the complexity varies greatly. Some are simple (e.g. Regular expression%r{
(?<code>
# Code blocks:
# ```
# Anything, including `/cmd arg` which are ignored by this filter
# ```
^```
.+?
\n```$
)
|
(?<html>
# HTML block:
# <tag>
# Anything, including `/cmd arg` which are ignored by this filter
# </tag>
^<[^>]+?>\n
.+?
\n<\/[^>]+?>$
)
|
(?<html>
# Quote block:
# >>>
# Anything, including `/cmd arg` which are ignored by this filter
# >>>
^>>>
.+?
\n>>>$
)
|
(?:
# Command not in a blockquote, blockcode, or HTML tag:
# /close
^\/
(?<cmd>#{Regexp.new(Regexp.union(names).source, Regexp::IGNORECASE)})
(?:
[ ]
(?<arg>[^\n]*)
)?
(?:\n|$)
)
}mix For many of the more complex ones it seems like a proper parser would have been much more suitable. For simple regular expressions I suspect a parser might be overkill, even when using parsing combinators. Lua's patterns system might be more suitable, but I think it somewhat suffers from the same issues as regular expressions: it works for small/simple patterns, but likely doesn't scale to more complex ones. To illustrate, here are some simple patterns from the CE codebase: Patterns//
/./
/^2/
/\n/
/\s/
/\s/
/-+/
/.+/
/.+/
/.+/
/\t/
/\s/
/../
/=~/
/==/
/!=/
/\s+/
/\d+/
/^\?/
/\s+/ Most of these are still reasonably easy to understand. The following patterns however already get more complicated: Complicated patterns%r{\/.*\/}
/\A(\d+)-/
/[\w\.-]+/
/\s+//\s+/
/^readme/i
/.*\.map$/
/#(\d+)\z/
%r{\Adoc/}
/section_/
/\A[^,]+\z/
/\A[^@]+\z/
%r{^/\w.*$}
/[\w%.+-]+/
/\s//[<>+]/
/key-(\d+)/
/[^\s\w.-]/
%r{^tcp://}
/\A\h{64}\z/
/\A\h{40}\z/
/[^-a-z0-9]/
/(\[[xX]\])/
/\A\w{8,}\z/
%r{\A.*?://}
/\A\h{40}\z/
/\AGitlab::/ The complexity here comes from using different anchors ( |
removed from epic &1 |
The VM should support a simple regular expression type and various operations for compiling regular expressions, finding matches, etc. The compiler / runtime in turn should use this. There are two ways this can be exposed to the language:
/foo/
) are used, and preferably the regex (at least for literals) is compiled when parsing the bytecodeThe text was updated successfully, but these errors were encountered: