Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.regex: major internal redesign, also fixes issue 13532 #5722

Merged
merged 15 commits into from
Oct 16, 2017

Conversation

DmitryOlshansky
Copy link
Member

@DmitryOlshansky DmitryOlshansky commented Sep 5, 2017

Finally I pulled this one off :

  • immutable Regex!Char works
  • StaticRegex is just a Regex alias
  • enum vs static ctRegex doesn't matter
  • template bloat is cut by a decent percent - match***/replace*** functions no longer templated by engine type nor StaticRegex is a distinct type
  • finally single point of managing intrusive ref-counting of Engines

More importantly this opens up future enhancements:

  • regex objects now contain a factory that produces optimal matcher for this pattern, the engine choice now can be completely adaptive (each ctRegex has a unique instance of such factory). Now I can special case the heck out of common patterns w/o degrading the design of the library a bit.
  • less of good ol' messy code, which eventually should bring more contributions to std.regex

@dlang-bot
Copy link
Contributor

Thanks for your pull request, @DmitryOlshansky!

Bugzilla references

Auto-close Bugzilla Description
13532 std.regex performance (enums; regex vs ctRegex)

@DmitryOlshansky DmitryOlshansky force-pushed the regex-matcher-interfaces branch 2 times, most recently from b3ca451 to dc32bbe Compare September 5, 2017 10:48
@DmitryOlshansky DmitryOlshansky force-pushed the regex-matcher-interfaces branch 9 times, most recently from 671a5f7 to b6bfd87 Compare September 6, 2017 09:46
@DmitryOlshansky
Copy link
Member Author

@wilzbach
89.774% (-0.001%) compared to 7c82e60

Is hillarious. Can we at least set some kind of tolerance of say about 0.01% ?

@wilzbach
Copy link
Member

wilzbach commented Sep 8, 2017

Is hillarious. Can we at least set some kind of tolerance of say about 0.01% ?

Sadly AFAIK CodeCov doesn't support this. I was working on querying the API from dlang bot and sending a similar CI status which is in our control.
However, sadly I ran out of time.

@DmitryOlshansky
Copy link
Member Author

DmitryOlshansky commented Sep 25, 2017

Wo-ho. Jenkins test succeeded. With auto-tester tackled, I think I'm done here.
Ping @wilzbach

UPDATE: still has bugs on Win32.. working on it.

Copy link
Member

@wilzbach wilzbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am not very familiar with std.regex, but most changes are rather trivial.
One a first pass, I only found a couple of nits, but it would be great if someone experienced with std.regex could have an eye on this overall big change as well.

@@ -716,7 +718,7 @@ template BacktrackingMatcher(bool CTregex)
debug(std_regex_matcher) writeln("pop array SP= ", lastState);
}

static if (!CTregex)
static if (true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents you from removing the static if entirely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehm. Will revisit this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the only thing missing. If you don't feel like trying to push this we could, of course, also remove this afterwards in a separate PR

{
_refCount = 1;
re = program;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving these two lines to initialize? They are repeated in all constructors ...

Copy link
Member Author

@DmitryOlshansky DmitryOlshansky Sep 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re is const so can't assign to it outside of constructor. A very unpleasant limitation

charsets = re.charsets;
foreach (ref set; re.charsets)
{
charsets ~= set.intervals;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@UplinkCoder will be happy about all this allocation at CTFE ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's the number of character classes in the regex pattern. I hope it's in range of 10s-100s.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about

auto oldLength = charsets.length;
charsets.length += re.charsets.length
charsets[oldLength .. $] = re.charsets[];

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefan-koch-sociomantic Not bad, honestly this piece of code isn't anywhere near top of performance sensitive functions. Trie construction is.
See code around this line:
https://github.com/dlang/phobos/blob/master/std/uni.d#L3871

import std.conv : to;
enum re1 = ctRegex!`[0-9][0-9]`;
immutable static re2 = ctRegex!`[0-9][0-9]`;
immutable iterations = 1000_000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 1_000_000

assert(result1 == result2);
auto ratio = 1.0 * enumTime.total!"usecs" / staticTime.total!"usecs";
// enum is faster or the diff is less < 30%
assert(ratio < 1.0 || abs(ratio - 1.0) < 0.3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing that you experimented quite a bit with this, this still seems a bit fragile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it is. It seems I can't provide even basic performance guarantees in the unittest.
This is a problem

return ++m.refCount;
}

override size_t decRef(Matcher!Char m) const @trusted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use postblit this(this) and destructor ~this() instead of rolling your own mechanism?

Copy link
Member Author

@DmitryOlshansky DmitryOlshansky Sep 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ploymorphism! structs are not polymorphic, this class is then wrapped in a proper struct that does the inc/dec

// This only maintains internal ref-count,
// deallocation happens inside MatcherFactory
@property ref size_t refCount() @safe;
// Copy internal state to another engine, using memory arena 'memory'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arena? Did you mean "zone" or "region"?
https://en.wikipedia.org/wiki/Region-based_memory_management

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not care.. a chunk of memory it is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wilzbach area or region it's all the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but "arena" sounded strange. Note that I don't care about it, it just struck me while reading the diff

// check if we have backreferences, if so - use backtracking
if (__ctfe) factory = null; // allows us to use the awful enum re = regex(...);
else
if (re.backrefed.canFind!"a != 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Common Phobos style is to have else if in the same line

{
auto r = cast() this;
r.factory = factory;
return r;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a nice pattern, you completely work around const here.
At least it's internal ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I feel like a compiler can prove this is legal iff all member of struct are value types or const references.

@@ -1530,15 +1468,15 @@ private:
@trusted this(Range input, RegEx separator)
{//@@@BUG@@@ generated opAssign of RegexMatch is not @trusted
_input = input;
separator.flags |= RegexOption.global;
auto re = separator.withFlags(separator.flags | RegexOption.global);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/auto/const

Disable "benchmark" in unittest, it's too volatile
with different compiler flags
Also use GC.addRange/GC.removeRange
@wilzbach
Copy link
Member

wilzbach commented Oct 6, 2017

Running DScanner
../dscanner-285ef162f024cbd305d587e9e0fcfb2292ea93ce/dsc --config .dscanner.ini --styleCheck etc std -I.
make: *** [dscanner] Segmentation fault (core dumped)
@wilzbach My code is so ugly?

Nay, sadly this random segfault has been appearing in the last two weeks. I nor anyone else hasn't had time to dive into. However, it only happens for every third run ...

@DmitryOlshansky
Copy link
Member Author

Alright, I think I addressed all of comments.

@CyberShadow @wilzbach @ZombineDev
What do you think?

@andralex
Copy link
Member

So why does this fail jenkins and codecov?

@stefan-koch-sociomantic

I'd say because it sees comments as uncovered code ?

@DmitryOlshansky
Copy link
Member Author

DmitryOlshansky commented Oct 11, 2017

So why does this fail jenkins and codecov?

Jenkin passes just fine. Codecov is a mess and doesn't work with std.regex package structure.

@wilzbach
Copy link
Member

Codecov is a mess and doesn't work with std.regex package structure.

Jup sadly that's true, but it's only intended as tool to help reviewers anyhow.

Copy link
Member

@wilzbach wilzbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should let @DmitryOlshansky move forward with this. The only thing I saw was the static if (true), but than one can be addressed later as well...

@DmitryOlshansky
Copy link
Member Author

Jup sadly that's true, but it's only intended as tool to help reviewers anyhow.

Should somehow indicate that it's optional. The common wisdom is red cross - no merge.

tmp.initExternalMemory(memory);
return tmp;
auto backtracking = cast(BacktrackingMatcher) m;
backtracking.s = s;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with (backtracking) comes to mind

Copy link
Member Author

@DmitryOlshansky DmitryOlshansky Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then

with(backtracking) {
      s = s;
}

???

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah one name would need to be changed :)

@DmitryOlshansky
Copy link
Member Author

@wilzbach
This, again...

../dscanner-285ef162f024cbd305d587e9e0fcfb2292ea93ce/dsc --config .dscanner.ini --styleCheck etc std -I.
std/regex/package.d(432:49)[warn]: Parameter matcher is never used.
make: *** [dscanner] Error 1

The code looks like this:

@trusted bool func(BacktrackingMatcher!Char matcher)
    {
        debug(std_regex_ctr) pragma(msg, source);
        mixin(source);
    }

I'm f**king tired of optimistic checks landing into our CI infrastructure.
If you can't prove something - don't make that check mandatory!

@DmitryOlshansky DmitryOlshansky force-pushed the regex-matcher-interfaces branch 2 times, most recently from 0f86e7d to a292141 Compare October 16, 2017 08:02
@MartinNowak
Copy link
Member

Great stuff, looks like it regressed our funky dlang-bot regex-split-joiner again.
18135 – [REG2.078] can't join RegexMatch anymore
Trying to reduce the test case.

@MartinNowak
Copy link
Member

There is also a reported performance regression @DmitryOlshansky.
18114 – [Reg 2.078] regex performance regression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants