Skip to content

Compiling string matching algorithms and regular expressions to java bytecode

License

Notifications You must be signed in to change notification settings

hyperpape/needle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Needle

Badge Badge

This library compiles regular expressions to Deterministic Finite Automata (DFA) (meaning that the regular expressions are non-backtracking), and then compiles them those to JVM ByteCode. Each regex becomes a separate JVM class.

Status

This project has no users as of yet, but should be usable. It would probably be advisable to precompile a static set of regexes, test them for your use cases, and verify that they perform well.

Usage

The library generates a pair of classes for each regex:a Pattern (com.justinblank.strings.Pattern, not java.util.regex.Pattern), and a Matcher (com.justinblank.strings.Matcher, not java.util.regex.Matcher).

These classes can either be created at runtime, when including the needle-compiler jar, or saved into a classfile which can be included in an application's classpath with the needle-types library.

Runtime Creation

Each call to DFACompiler.compile will create a new class.

static final Pattern URL_PATTERN = DFACompiler.compile("http://.+", "OverSimplifiedURLMatcher");
....
Matcher matcher = URL_PATTERN.matcher("http://www.google.com");
assertTrue(matcher.matches());
assertTrue(matcher.containedIn());
MatchResult matchResult = matcher.find();
assertTrue(matchResult.matched);
assertEquals(0, matchResult.start);
assertEquals(21, matchResult.end);

Precompilation

At build time, we can create a classfile and write it to the filesystem:

Precompile.precompile("http://.+", "OversimplifiedURLMatcher", somedirectory.getAbsolutePath());

At run-time, we can construct our class and build a matcher:

Pattern pattern = new OversimplifiedURLMatcher();
Matcher matcher = pattern.matcher("http://www.google.com");
assertTrue(matcher.matches());

See Pattern for the supported operations.

Compatibility and Syntax

This library attempts to match the standard library syntax for all supported operations. For any regex, the results of using this library should be the same as the standard library, or a RegexSyntaxException should be raised during parsing. Any other discrepancies should be reported as a bug.

Capturing groups are not currently supported. Backreferences are unlikely to ever be supported.

The following character classes are supported:

\a, \d, \D, \e, \f, \h, \s, \S, \t, \w, \W, \x, \0, \., \\

Unicode

The library supports searching against any string, however the needles that we search for are currently limited to the Basic Multilingual Plane. Regexes containing non-ascii characters are currently likely to be much slower than ASCII regexes (see Issue #16). Testing of non-ascii regexes and non-ascii matches is currently less comprehensive.

Java versions

The compiler requires Java 11. Generated classes should work with Java 8.

About

Compiling string matching algorithms and regular expressions to java bytecode

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages