Regular expressions #161

yorickpeterse · 2017-12-20T15:08:01Z

The VM should support a simple regular expression type and various operations for compiling regular expressions, finding matches, etc. The compiler / runtime in turn should use this. There are two ways this can be exposed to the language:

Python style where a "regex" module is used to compile a regular expression from a string
Ruby style where regex literals (e.g. /foo/) are used, and preferably the regex (at least for literals) is compiled when parsing the bytecode

The text was updated successfully, but these errors were encountered:

yorickpeterse · 2017-12-20T15:09:56Z

Regex interning could be done by reusing interned strings and mapping these to their regex objects. This way re-using the same regex literals would result in only 1 heap allocation. Since all of this happens when parsing bytecode it won't have any runtime overhead (once all modules are loaded).

yorickpeterse · 2017-12-20T16:40:05Z

Rust's regex crate would be the most likely underlying engine used for this. I in particular like its support for limiting the size of compiled regular expressions, which would be an interesting feature to expose to the Inko runtime.

The VM instructions we'd need (that I can think of) would be:

RegexCompile: takes a string and compiles a regex
RegexFind: takes a regex and a string and returns an Array of matches
- Named capture groups would require the VM to support hash maps

yorickpeterse · 2018-10-06T15:26:36Z

Per rust-lang/regex#362 it seems the regex crate relies on thread-local storage, but doesn't clean it properly. This is problematic for two reasons:

Regular expressions might be used on different threads from which they were created. It's not clear what impact this would have performance wise.
We would leak memory by using this crate.

yorickpeterse · 2018-10-12T13:37:30Z

One option might be to ditch supporting regular expressions in the standard library, instead deferring this to a C library used through the FFI API.

Alternatively, we could perhaps support Rosie:

yorickpeterse · 2018-10-12T18:07:56Z

Rosie appears to be offloading most of its work to Lua, with the C library being a wrapped around the Lua embedding API (if I'm reading everything properly). This may result in undesirable overhead in the context of using Rosie from Inko.

yorickpeterse · 2018-12-26T15:04:51Z

Rosie seems like an unlikely candidate, but the underlying idea (reusable and testable PEG grammars) is very interesting.

There are currently two ideas I'm toying with:

Syntax based grammars, similar to Perl 6's rules.
Parsing combinators

Syntax grammars

Syntax based grammars would be more verbose compared to regular expressions, but also easier to read. Let's say we want to parse an ISO 8601 date. Using regular expressions, we may end up with something like the following:

/\d+-\d{1,2}-\d{1,2}\s+\d{2}:\d{2}:\d{2}/

The syntax based grammar would instead look something like the following:

grammar {
  main { year '-' month '-' day ' '+ hour ':' minute ':' second }

  digit { [0-9] }
  year { digit+ }
  month { digit digit? }
  day { digit digit? }
  hour { digit digit }
  minute { digit digit }
  second { digit digit }
}

Here main would be the entry point of the grammar. This is much more verbose, but easier to read. If one were able to import existing grammars, they might be able to reduce it down to the following:

grammar {
  import time

  # year_month_day and hour_minute_second would be defined in the imported package.
  main { year_month_day ':' hour_minute_second }
}

A big downside of syntax grammars is the work necessary to implement them. This would require the compiler to parse and verify the syntax, and generate the necessary VM instructions or Inko code to parse the input efficiently. Using an existing Rust crate might allow us to offload all of this to said crate at runtime, but thus far I have been unable to find a well maintained crate for parsing PEG at run time; most only support parsing them at compile time.

Parsing combinators

Parsing combinators are just methods that take some kind of input, and return a parser for that input. To parse the same ISO 8601 date mentioned above, you may write the following:

import std::parse

let digit = parse.any(['0', '1', '2', '3' ,'4', '5', '6', '7', '8', '9'])
let space = parse.character(' ')
let spaces = parse.many(space)
let year = parse.many(digit)
let month = digit.optional(digit)
let day = digit.optional(digit)
let double_digit = digit.then(digit)

let parser = year
  .then('-')
  .then(month)
  .then('-')
  .then(day)
  .then(spaces)
  .then(double_digit)
  .then(':')
  .then(double_digit)
  .then(':')
  .then(double_digit)

While this approach is much more verbose, it has several benefits:

We do not need to extend the compiler or VM, as everything is done at run time. This means we could implement it today.
Since everything is done at run time, it's easier to combine this with custom code that needs to be run at certain points.
Since it's just regular code, you should in theory be able to better control the errors produced.
No dependencies will be necessary.
The produced parsers can be optimised like any other Inko code, instead of requiring separate optimisation strategies.
Since it's just regular code, you can share your parsers like any other Inko code.

There are also a few downsides to parser combinators:

They can get verbose pretty quickly. The above date example is already quite verbose, though this can be reduced if Inko were to provide a library of common parsers.
Because everything happens at run time, the compiler can't really optimise and verify the code in any special way. For example, when using a grammar a compiler might be able to detect in grammar A | A B the A B branch might not be executed if the engine is LL(1).
The type signatures necessary to make all of this happen may end up being rather verbose.
They're not very commonly used outside of functional programming languages, let alone as a replacement for regular expressions.

yorickpeterse · 2018-12-26T15:07:19Z

Also worth adding:

Regular expressions are great for very simple patterns, such as \d+ and Hello \w+. Unfortunately, they do not scale well to complex patterns as they quickly become very unreadable.

PEG grammars and parsing combinators (regardless of the parsing algorithm used) sacrifice compactness for readability and maintainability. This means that simple PEGs/combinators might be more verbose compared to the equivalent regular expressions, but they (in my opinion) tend to be equally readable when used for very complex parsers.

In other words: the "startup cost" of PEGs/combinators is a bit higher, but it stays more or less the same. Regular expressions on the other hand start out fairly straightforward, but become more painful to work with very quickly.

yorickpeterse · 2019-05-25T21:09:43Z

added to epic &1

yorickpeterse · 2019-05-26T15:46:56Z

There are three questions to think about here:

How often are regular expressions used?
When they are used, are regular expressions actually the right tool?
How complex do these regular expressions tend to be.

In all cases the answer comes down to "That depends", but I used GitLab CE to try and get some more data. In CE, there are a total of 556 regular expressions in application code (excluding configuration and tests). I used the following script for this:

Ruby script

# frozen_string_literal: true

# gem install parallel
require 'ripper'
require 'parallel'

patterns =
  Parallel
    .map(Dir['./{app,lib}/**/*.rb'], in_processes: 4) do |path|
      found = []
      buffer = ''
      in_regex = false

      Ripper.lex(File.read(path)).each do |token|
        if token[1] == :on_regexp_beg
          in_regex = true
        end

        if in_regex
          buffer += token[2]
        end

        if token[1] == :on_regexp_end
          in_regex = false

          found << buffer.gsub("\n", '\n')
        end
      end

      found
    end
    .inject([]) { |acc, curr| acc.concat(curr) }

buckets = Hash.new(0)

patterns.each do |pattern|
  bucket =
    if pattern.length >= 100
      pattern.length.round(-2)
    else
      pattern.length.round(-1)
    end

  buckets[bucket] += 1
end

puts 'Bucket Amount'
buckets.sort.each do |(bucket, value)|
  puts "#{bucket.to_s.ljust(6, ' ')} #{value.to_s.ljust(6, ' ')} #{'▮' * value}"
end

Using this on CE, the output is:

Output

Bucket Amount
0      16     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
10     99     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
20     85     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
30     52     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
40     50     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
50     39     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
60     18     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
70     14     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮
80     20     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
90     7      ▮▮▮▮▮▮▮
100    44     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
200    31     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
300    20     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
400    15     ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
500    10     ▮▮▮▮▮▮▮▮▮▮
600    9      ▮▮▮▮▮▮▮▮▮
700    2      ▮▮
800    8      ▮▮▮▮▮▮▮▮
900    2      ▮▮
1000   7      ▮▮▮▮▮▮▮
1100   4      ▮▮▮▮
1200   1      ▮
1400   3      ▮▮▮

Going through the individual regular expressions, the complexity varies greatly. Some are simple (e.g. /./), while others are complex like this one:

Regular expression

%r{
            (?<code>
              # Code blocks:
              # ```
              # Anything, including `/cmd arg` which are ignored by this filter
              # ```

              ^```
              .+?
              \n```$
            )
          |
            (?<html>
              # HTML block:
              # <tag>
              # Anything, including `/cmd arg` which are ignored by this filter
              # </tag>

              ^<[^>]+?>\n
              .+?
              \n<\/[^>]+?>$
            )
          |
            (?<html>
              # Quote block:
              # >>>
              # Anything, including `/cmd arg` which are ignored by this filter
              # >>>

              ^>>>
              .+?
              \n>>>$
            )
          |
            (?:
              # Command not in a blockquote, blockcode, or HTML tag:
              # /close

              ^\/
              (?<cmd>#{Regexp.new(Regexp.union(names).source, Regexp::IGNORECASE)})
              (?:
                [ ]
                (?<arg>[^\n]*)
              )?
              (?:\n|$)
            )
        }mix

For many of the more complex ones it seems like a proper parser would have been much more suitable. For simple regular expressions I suspect a parser might be overkill, even when using parsing combinators. Lua's patterns system might be more suitable, but I think it somewhat suffers from the same issues as regular expressions: it works for small/simple patterns, but likely doesn't scale to more complex ones.

To illustrate, here are some simple patterns from the CE codebase:

Patterns

//
/./
/^2/
/\n/
/\s/
/\s/
/-+/
/.+/
/.+/
/.+/
/\t/
/\s/
/../
/=~/
/==/
/!=/
/\s+/
/\d+/
/^\?/
/\s+/

Most of these are still reasonably easy to understand. The following patterns however already get more complicated:

Complicated patterns

%r{\/.*\/}
/\A(\d+)-/
/[\w\.-]+/
/\s+//\s+/
/^readme/i
/.*\.map$/
/#(\d+)\z/
%r{\Adoc/}
/section_/
/\A[^,]+\z/
/\A[^@]+\z/
%r{^/\w.*$}
/[\w%.+-]+/
/\s//[<>+]/
/key-(\d+)/
/[^\s\w.-]/
%r{^tcp://}
/\A\h{64}\z/
/\A\h{40}\z/
/[^-a-z0-9]/
/(\[[xX]\])/
/\A\w{8,}\z/
%r{\A.*?://}
/\A\h{40}\z/
/\AGitlab::/

The complexity here comes from using different anchors (\A and \z) and other non self-describing character classes. Of course one familiar with regular expressions will recognise these, but /key-(\d+)/ is still harder to understand than something like string('key-').digits.

yorickpeterse · 2019-05-26T16:03:43Z

removed from epic &1

yorickpeterse added feature New things to add to Inko, such as a new standard library module and removed type::feature labels Mar 15, 2023

yorickpeterse removed this from the 0.2.5 milestone Mar 15, 2023

yorickpeterse removed the runtime label Mar 15, 2023

yorickpeterse mentioned this issue Dec 29, 2023

Support parsing String into DateTime values #677

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular expressions #161

Regular expressions #161

yorickpeterse commented Dec 20, 2017

yorickpeterse commented Dec 20, 2017

yorickpeterse commented Dec 20, 2017

yorickpeterse commented Oct 6, 2018

yorickpeterse commented Oct 12, 2018

yorickpeterse commented Oct 12, 2018

yorickpeterse commented Dec 26, 2018

yorickpeterse commented Dec 26, 2018

yorickpeterse commented May 25, 2019

yorickpeterse commented May 26, 2019

yorickpeterse commented May 26, 2019

Regular expressions #161

Regular expressions #161

Comments

yorickpeterse commented Dec 20, 2017

yorickpeterse commented Dec 20, 2017

yorickpeterse commented Dec 20, 2017

yorickpeterse commented Oct 6, 2018

yorickpeterse commented Oct 12, 2018

yorickpeterse commented Oct 12, 2018

yorickpeterse commented Dec 26, 2018

Syntax grammars

Parsing combinators

yorickpeterse commented Dec 26, 2018

yorickpeterse commented May 25, 2019

yorickpeterse commented May 26, 2019

yorickpeterse commented May 26, 2019