-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider Span<char> overloads on Regex classes #23602
Comments
copied danmosemsft's initial post: Right now System.Text.RegularExpression.* only works against strings. Matching against buffers requires a string be created from the buffer. That string might not be needed again (especially if there's no match). Motivating example: a grep tool that enumerates files in a directory tree, and scans their content against a pattern. It should not need to create strings except when a line in the buffer matches the pattern. @stephentoub mentioned that while Regex.IsMatch could likely have a Span overload, ReadOnlyMemory may be a better fit for Regex.Match, as it returns a Match object. Also worth considering is a Stream overload. This proposal obviously isn't ready for review. @ViktorHofer could you take on making this into a reviewable proposal? It seems to be a fit for your regex perf mission.. |
@JeremyKuhne presumably your regex filter for GetFiles could operate more efficiently if IsMatch could accept a span? |
Yes I already started thinking about it. Will need bit more investigation. |
Even if we can do this in phases it would be awesome. Presumably it wouldn't be too hard to change RegexRunner.runtext to ReadOnlySpan and that would open up hitting statics on Regex? Haven't looked too deeply, but the key is not the pattern, but the input. |
+1 |
This would really help Select-String in PowerShell. Tried to parallelize it and ran into GC death on string allocations for the inputs. So +1 |
Just as a summary of the
Maybe also provide
|
cc @JeremyKuhne |
Some discussion in https://github.com/dotnet/corefx/issues/27267 @powercode for your |
Design in progress but will miss 2.1 |
RegexRunner.runtext is exposed (internal protected) and therefore can't be changed easily without breaking already compiled assemblies. |
|
Any progress on this? |
Just my 2 cents but I'm excited for this to be a reality since it seems so logical to expose Isn't the real difference this? string Value => Text.Substring(Index, Length); vs ReadOnlySpan<char> ValueSpan => Text.AsSpan().Slice(Index, Length); I'm definitely not a pro at Spans, but I get the underlying benefit. |
Right I agree, the |
I have another use case for this proposal. I'm writing a compiler for a simple scripting language. Provided how simple the language is, I'm planning to just use a simple regex-based tokenizer (an LL-recursive descent lexer or something would be overkill). Now, I have exactly 25,3 MB worth of text to process. Having to create substrings for every single token means millions of allocations. (Still, this isn't exactly a major problem for me. Performance isn't critical. I can easily allow corert to spend a few seconds allocating/deallocating strings. I ended up here mainly because I "wanted to do the right thing" and was curious to see why Regex API doesn't accept spans.) |
Are there any plans to look at adding ReadOnlySpan overloads or a different implementation for Regex for .NET 6? |
@michaelherndon there are not plans for .NET 6. If anyone is passionate about this, the next step is to finalize an API proposal. That would probably need a fair bit of thought -- see detailed comments by @terrajobst above. One other thing to consider is that there are 2 regex engines (the mode that compiles at runtime and the interpreter) and any new API would need implementing in all engines. We are considering adding others -- a source generator one, which would be a variation of compiled but at build time using C# instead of ref-emit, and possibly a DFA based engine which would provide stronger execution time guarantees.. Neither of these will be in .NET 6, but the first one is probably funded for post .NET 6, possibly as an out of band release. So with 2, 3 or 4 engines, each based on quite different mechanisms, implementing new API would be some work. |
@danmoseley thanks for the update. That definitely sounds like a bit of work. |
@michaelherndon could you say more about your scenario? |
I was debating about trying to rewrite C# code based on rails Inflector logic for pluralization and singularization. Inflector uses a fair amount of regular expressions. Humanizer has similar logic. Using ReadOnlySpan for changing code styles and capitalization is pretty straightforward. Transforming something like Inflector into a parser to leverage ReadOnlySpan would be a bit of work, so I looking to see if Regex started implementing it to leave it as-is and still benefit from lowering allocations/copies of strings. |
Four years passed and the issue is still open. I hope it's implemented in NET6. |
At this point it seems unlikely, given that we still don't have a design. |
Uhg. Why! Why can't this happen?! It's such a logical step. Why can't we just provide a |
See my comment above. It's not that it can't, just that it requires design work. And generally speaking the runway for .NET 6 is closing. |
As the rando internet person that bumped the thread: I get the frustration of a 4-year-old feature request and obviously I do see the benefits of Regex having lower allocations. Conversely, I also get that only so many things can go into the release of a widely used framework that values backwards compatibility. Even for a company like Microsoft, there are still only so many resources and hours. Personally, I'd still take the improvements in System.Security, the new Metrics API, or the elusive Assembly Neutral Interfaces over a new Regex API surface. |
Ok... So I have to admit that I'm a bit frustrated at this one. |
This is probably not the solution you're looking for, but since the source is available, you could do some surgery: public static class CaptureSurgery
{
private static readonly PropertyInfo _textProperty = typeof(Capture).GetProperty("Text", BindingFlags.Instance | BindingFlags.NonPublic);
public static ReadOnlySpan<char> AsSpan(this Capture capture)
{
return ((string)_textProperty.GetValue(capture)).AsSpan();
}
} Although since the point is improved performance, this is hardly a solution. |
@BenjaBobs If you're worried about performance, create and cache a delegate to the getter. Also, you need to slice the text to get the value of the capture: public static class CaptureSurgery
{
private static readonly Func<Capture, string> _textDelegate =
typeof(Capture).GetProperty("Text", BindingFlags.Instance | BindingFlags.NonPublic)
.GetGetMethod(nonPublic: true)
.CreateDelegate<Func<Capture, string>>();
public static ReadOnlySpan<char> AsSpan(this Capture capture) =>
_textDelegate.Invoke(capture).AsSpan(capture.Index, capture.Length);
} |
@BenjaBobs @svick I didn't know that was possible! That's awesome to know. Thank you! |
And just a note: |
@svick What version of .NET has the generic call for |
@electricessence As of right now only .NET, i.e. 5 and 6+. |
@silkfire Got it to work in .NET Standard without the generic methods. I wonder if there's a boxing cost. |
edit by @ViktorHofer, moved initial post down.
Spanifying Regex removes (a) unnecessary string allocations that tend to decrease perf and (b) allows different types of Memory to be processed.
API proposal and implementation: ViktorHofer/corefx#1
Proposed APIs
This diff contains the Memory overloads and the MatchEvaluator overloads. See discussion above if we should introduce new ref types for Match, Group & Capture.
Discussion points
Ref struct for Match and siblings (Capture & Group).
I had a discussion with Jan offline and he pointed out that we might want to introduce a
ref struct MatchValue
type that is returned by APIs that take Span/Memory as an input.The issues with that is that we currently have the following hiearchy: Match --> Group --> Capture and that Groups and Match contain collections of Captures/Groups.
startat overload
Should we add these startat convenience overloads for Span also?
If yes, this commit should be reverted ViktorHofer/corefx@bf7d7f9
RegexSplitEnumerator RTL yield order
If you call the Span version of Regex.Split and pass
RegexOptions.RightToLeft
to it the yield order of the enumerator will also be right to left as we start looking for matches from right to left. The current implementation (which is not an enumerator!) reverses the captured strings before returning.RegexSplitEnumerator GetEnumerator (see ref diff)
The text was updated successfully, but these errors were encountered: