Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve routing RegEx constraint dependency size issue #46142

Closed
eerhardt opened this issue Jan 18, 2023 · 57 comments · Fixed by #46793 or #46813
Closed

Resolve routing RegEx constraint dependency size issue #46142

eerhardt opened this issue Jan 18, 2023 · 57 comments · Fixed by #46793 or #46813
Assignees
Labels
area-web-frameworks *DEPRECATED* This label is deprecated in favor of the area-mvc and area-minimal labels feature-routing
Milestone

Comments

@eerhardt
Copy link
Member

eerhardt commented Jan 18, 2023

Routing has a feature named "Route Constraints". One of the options to make a constraint is to add a regular expression to the route, for example: app.MapGet("/posts/{id:regex(^[a-z0-9]+$)}", …). Because these route constraints are inline in the route string, the Regex code needs to always be in the application, in case any of the routes happen to use a regex constraint.

In .NET 7, we added a new feature to Regex: NonBacktracking. This added a considerable amount of code. Depending on the Regex constructor overload used (the one that takes RegexOptions, which ASP.NET Routing uses), this new feature's code will be left in the app, even if the NonBacktracking engine isn't being used.

ASP.NET Routing uses the CultureInvariant and IgnoreCase options when constructing Regex route constraints.

Testing locally, being able to remove the NonBacktracking engine can cut about .8 MB of the 1.0 MB of Regex code out of the app size.

UPDATE 11/30/2022

With the latest NativeAOT compiler changes, here are updated numbers for linux-x64 NativeAOT:

Hello World 3.22 MB (3,381,112 bytes)
new Regex("").IsMatch 3.50 MB (3,680,200 bytes)
new Regex("", RegexOptions).IsMatch 4.33 MB (4,545,960 bytes)

UPDATE 1/18/2023

With the latest NativeAOT compiler changes, here are updated numbers for linux-x64 NativeAOT:

Hello World 2.79 MB (2,925,616 bytes)
new Regex("").IsMatch 3.04 MB (3,193,344 bytes)
new Regex("", RegexOptions).IsMatch 3.82 MB (4,008,288 bytes)

Options

  1. Add new Regex construction APIs that allow for some RegexOptions to be used, but also allows for the NonBacktracking engine to be trimmed. For example: Regex.CreateCompiled(pattern, RegexOptions). This API would throw an exception if RegexOptions.NonBacktracking was passed.
  2. Remove the use of RegexOptions. The IgnoreCase option can be specified as part of the pattern as a Pattern Modifier: (?i). However, CultureInvariant cannot be specified this way.
    • One option is to drop support for CultureInvariant from regex route constraints. This affects the Turkish 'i' handling.
    • Another option could be to add a CultureInvariant Pattern Modifier to .NET Regex, so this could be specified without using RegexOptions.
  3. When using the new "slim" hosting API (See Allow minimal host to be created without default HostBuilder behavior #32485), we could remove the Regex route constraints feature by default and have an API that adds it back. Apps using the inline regex route constraints and the slim hosting API, would call this new API to opt-in to the feature.
  4. Add a feature to the linker/NativeAOT compiler that can see which RegexOptions values are used in the app. And for the enum values that aren't used, it can trim the code branches behind those values (i.e. the NonBacktracking code). This would accrue more size savings elsewhere as well.
@eerhardt
Copy link
Member Author

FYI - @agocke @MichalStrehovsky @vitek-karas -

This is another case that dotnet/linker#1868 would "just handle". We never use the NonBacktracking option in our Regex's, but the NonBacktracking code can't be trimmed.

Our plan for .NET 8 is to do something in ASP.NET to get the Regex code trimmed/smaller.

@mitchdenny
Copy link
Member

  1. Add new Regex construction APIs that allow for some RegexOptions to be used, but also allows for the NonBacktracking engine to be trimmed. For example: Regex.CreateCompiled(pattern, RegexOptions). This API would throw an exception if RegexOptions.NonBacktracking was passed.

@eerhardt Is the trimmer smart enough to know that if a particular option is specified that it throws an exception and so it doesn't have to evaluate code that might be activated by that code? Seems very smart if it can!

@MichalStrehovsky
Copy link
Member

This would not be just about figuring out constants at each callsite, but also being able to constant propagate the bit checks. Things get very complicated very quickly. I don't think this can fit in 8.0.

Also if we do it, there will also be an endless stream of follow up requests for similar constant pattern issues. E.g. just yesterday I had to work around a similar API that accepts a nullable - if the nullable is null, an expensive thing needs to happen to compute it. If it's not null, the call is cheap: dotnet/runtime#80677

We need to really start paying attention to these patterns in API reviews. Optional parameters, flags that do wildly different things... those should ideally be separate methods if we want good trimming.

@mitchdenny
Copy link
Member

@MichalStrehovsky - so that is primarily in relation to option #4 above right? if we can make changes to the regex library such that the existing trimming behavior can trim that excess that we don't need in ASP.NET it would be a viable path forward?

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky - so that is primarily in relation to option #4 above right? if we can make changes to the regex library such that the existing trimming behavior can trim that excess that we don't need in ASP.NET it would be a viable path forward?

Yep. I think option 4 would be difficult to fit into .NET 8. Trimming is a single pass operation, architecturally. This feature requires knowledge of all callsites before we can do trimming (i.e. it adds an entire new pass).

@mitchdenny
Copy link
Member

@eerhardt what do you think about getting this constructor marked public so we can call it:

https://source.dot.net/#System.Text.RegularExpressions/System/Text/RegularExpressions/Regex.cs,b38ad9d9698a3bc5

In fact, the comment in the constructor is telling:

        internal Regex(string pattern, CultureInfo? culture)
        {
            // Validate arguments.
            ValidatePattern(pattern);
 
            // Parse and store the argument information.
            RegexTree tree = Init(pattern, RegexOptions.None, s_defaultMatchTimeout, ref culture);
 
            // Create the interpreter factory.
            factory = new RegexInterpreterFactory(tree);
 
            // NOTE: This overload _does not_ delegate to the one that takes options, in order
            // to avoid unnecessarily rooting the support for RegexOptions.NonBacktracking/Compiler
            // if no options are ever used.
        }

We could change our callsite to flip on ignore case via the Regex itself and pass in CultureInfo.Invariant.

@mitchdenny
Copy link
Member

Looping in @stephentoub since he added this comment so presumably he was trying to avoid it for AOT scenarios as well (or at least to enable it to be trimmmed).

@stephentoub
Copy link
Member

Instead of:

var r = new Regex("...", RegexOptions.IgnoreCase | RegexOptions.InvariantCulture);

could ASP.NET just do:

Regex r;
CultureInfo old = CultureInfo.CurrentCulture;
CultureInfo.CurrentCulture = CultureInfo.InvariantCulture;
try
{
    r = new Regex("(?i)(?:...)");
}
finally
{
    CultureInfo.CurrentCulture = old;
}

?

@mitchdenny
Copy link
Member

OK, so you are relying on this mechanism here to make sure we pick up the invariant culture:

        internal static CultureInfo GetTargetCulture(RegexOptions options) =>
            (options & RegexOptions.CultureInvariant) != 0 ? CultureInfo.InvariantCulture : CultureInfo.CurrentCulture;

@stephentoub
Copy link
Member

Yes

@danmoseley
Copy link
Member

would we recommend that to customers too? (who wanted smallest possible size) or do we imagine a different approach long term to drop nonbacktracking?

@mitchdenny
Copy link
Member

It has a bit of a hacky feel to it as a work around. It is probably fine to get us past this for ASP.NET but I'd love a first class option on the Regex class.

@davidfowl
Copy link
Member

I rather not set async locals to set the current culture, that's a bit hacky.

@mitchdenny
Copy link
Member

So what are the other options here? Modifying the Regex constructure to include something like this?

public Regex(string pattern, bool ignoreCase, CultureInfo? culture ) { }

... and a bunch of other variants?

@javiercn
Copy link
Member

3. We could remove the Regex route constraints feature either with an API that adds it back, or a feature switch set in the .csproj

I do not think removing the feature is an acceptable solution here. It's been there since the times of ASP.NET and is a well-known and used feature, removing it by default for AoT does not seem like the right trade-off.

  1. Add new Regex construction APIs that allow for some RegexOptions to be used, but also allows for the NonBacktracking engine to be trimmed. For example: Regex.CreateCompiled(pattern, RegexOptions). This API would throw an exception if RegexOptions.NonBacktracking was passed.

This seems the more reasonable solution to me, or alternatively a compile-time switch to disable it at the runtime level and trim the associated code.

It might also be interesting to explore the possibility of using the regex source generator to codegen these regexes, although I understand that it might have some drawbacks, like not being able to express regex constraints at runtime if there was no way to discover them at compile time.

@javiercn javiercn added area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates feature-routing labels Jan 18, 2023
@javiercn javiercn added this to the .NET 8 Planning milestone Jan 18, 2023
@ghost
Copy link

ghost commented Jan 18, 2023

Thanks for contacting us.

We're moving this issue to the .NET 8 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@stephentoub
Copy link
Member

stephentoub commented Jan 18, 2023

would we recommend that to customers too? (who wanted smallest possible size) or do we imagine a different approach long term to drop nonbacktracking?

They should be using the source generator. So should ASP.NET for the cited scenario, except that currently source generators can't depend on other source generators.

I rather not set async locals to set the current culture, that's a bit hacky.

The fact the CultureInfo uses an async local under the covers is an implementation detail, and this won't affect async locals flow at all, since the culture change is immediately undone, with no async operations performed in the interim. As for it being hacky, it's using the exact mechanism Regex has had for this for over a decade; Regex looks for the current culture, and if you want to change the culture it uses, you need to set the current culture. Setting invariant is so common it also makes it easy via a flag, but the other mechanism still exists, and we can use it here, today, to achieve the stated goals.

So what are the other options here? Modifying the Regex constructure to include something like this?

The other options are the ones Eric enumerates. I don't see us adding the cited ctor "and a bunch of other variants"... as of today there are 10 RegexOptions, and I have no intention of adding 2^10 ctors. If we eventually want to add new APIs related to this, it would likely be CreateInterpreted, CreateCompiled, and CreateNonBacktracking. That, however, is not without its disadvantages/warts, e.g. today NonBacktracking doesn't make use of Compiled, but it could in the future.

NonBacktracking

As an aside, this issue is about being able to trim out NonBacktracking but the cited use is for a regex that is very exposed to untrusted input, and I expect there are a fair number of uses that are susceptible to ReDoS attacks without the dev realizing it, i.e. without them fully understanding the implications of the pattern that will be evaluated against every query string. This could actually be a situation where NonBacktracking is exactly what's desired, for security.

I think option 4 would be difficult to fit into .NET 8

I realize it's challenging. But I opened that issue almost two years ago due to seeing real opportunity for savings, both with the patterns I share as examples, and the variants you cite. We're going to continue to find scenarios it helps, like this one, and until it's addressed, we're going to continue struggling to either find workarounds for those trimming issues or leave size on the table.

I believe the solution today for routing is the setting of CultureInfo, as I outlined. If there's a technical reason why those few lines don't solve the problem today, I'd like to understand better why.

@davidfowl
Copy link
Member

davidfowl commented Jan 18, 2023

They should be using the source generator. So should ASP.NET for the cited scenario, except that currently source generators can't depend on other source generators.

This also only works for constants, so it's a solution but not one for all cases.

The fact the CultureInfo uses an async local under the covers is an implementation detail, and this won't affect async locals flow at all, since the culture change is immediately undone, with no async operations performed in the interim. As for it being hacky, it's using the exact mechanism Regex has had for this for over a decade; Regex looks for the current culture, and if you want to change the culture it uses, you need to set the current culture. Setting invariant is so common it also makes it easy via a flag, but the other mechanism still exists, and we can use it here, today, to achieve the stated goals.

It's hacky and the only reason it works is because it's an async local otherwise it wouldn't just be hacky, it would be buggy (being static mutable state).

I believe the solution today for routing is the setting of CultureInfo, as I outlined. If there's a technical reason why those few lines don't solve the problem today, I'd like to understand better why.

There's a reason why it "feels hacky" to mutate unrelated state to create an instance of a Regex. That said, if this is a short-term workaround for a longer-term solution, then it's fine. We've done similar thing with the ExecutionContext flow for APIs that don't support disabling it.

@mitchdenny create a similar helper for Regex with a giant comment linking to why it works and linking to the linked issues in this issue.

@stephentoub
Copy link
Member

This also only works for constants, so it's a solution but not one for all cases.

In my experience and investigations in this area, it's the 95% case. The dynamic cases are typically:

  • Tests. Binary size concerns typically don't exist.
  • Someone is accepting the pattern and options to use via public API. These cases should instead accept a Regex instance that let's the caller use source generation. And for cases where that's not possible, callers are often specifying options that involve execution strategy (e.g. Compiled), so even with CreateXx methods, an implementation wanting to support the same capabilities would still end up rooting the various code paths unless it also propagated fine-grained APIs that exposed every such strategy.
  • A file of regexes loaded dynamically. These should also be migrated to the source generator. On top of that, such cases often benefit from NonBacktracking, as the authoring of the regex is frequently not as reviewed/vetted as one directly in source.
  • Unnecessarily using Regex, e.g. string/span methods used directly would be better.

Obviously there are legit cases beyond that, but they're the long tail. They typically use Regex.Escape, and searching around you can see that the number of uses of Regex.Escape pales in comparison to the use of Regex. Not using Escape with dynamic input is a good indicator of a ReDoS waiting to happen if NonBacktracking isn't used.

the only reason it works is because it's an async local otherwise it wouldn't just be hacky, it would be buggy (being static mutable state).

The use in this case has little to do with it being an "async" local. It's to do with it using thread-local state. That is the design of CultureInfo... whether we like it or not, it's based on ambient data. Someone who wants to pass in a culture to lots of scoped code sets the culture, just as ASP.NET currently does for request localization. APIs like Regex that want to participate in this read the current culture. It's not a "hack" to use this mechanism with regex; whether we like it or not, that's how the mechanism was designed and regex uses that mechanism. It's certainly not ideal that using RegexOptions.InvariantCulture roots everything related to options, but using current culture to work around that is no more a hack than adding a new ctor accepting culture to work around that.

We can debate whether 20 years ago the right calls were made as to how CultureInfo works, how culture is passed into tons of APIs like Regex, and even whether Regex should use culture at all. At this point, though, such a mechanism exists to solve the core problem cited in this issue. Arguing for a different design for this ASP.NET issue is no longer in defense of trimming/size.

@davidfowl
Copy link
Member

Setting the culture in the middleware whose job is to set the culture is different to setting to cause a side effect in the regex constructor because we don't have a better way of doing it.

Ambient data shouldn't be abused this way and it is absolutely a hack but we'll hack it for now.

@stephentoub
Copy link
Member

I'm not going to continue the debate. You have a solution.

@davidfowl
Copy link
Member

We’ll open a new issue with an API to accomplish this after we apply the workaround

@eerhardt
Copy link
Member Author

Instead of:

var r = new Regex("...", RegexOptions.IgnoreCase | RegexOptions.InvariantCulture);

could ASP.NET just do:

Regex r;
CultureInfo old = CultureInfo.CurrentCulture;
CultureInfo.CurrentCulture = CultureInfo.InvariantCulture;
try
{
    r = new Regex("(?i)(?:...)");
}
finally
{
    CultureInfo.CurrentCulture = old;
}

?

We can't because of 2 reasons:

  1. We also need to specify a match timeout. There is no API that takes a Timeout, but doesn't take RegexOptions.

Constraint = new Regex(
regexPattern,
RegexOptions.CultureInvariant | RegexOptions.IgnoreCase,
RegexMatchTimeout);

  1. We also should be using Compiled, but aren't today. I opened RegexRouteConstraint should use Compiled Regex #46154 for this. (Although this could easily be worked around by checking RuntimeFeature.IsDynamicCodeSupported and not passing Compiled when it isn't supported.)

@eerhardt
Copy link
Member Author

I do not think removing the feature is an acceptable solution here. It's been there since the times of ASP.NET and is a well-known and used feature, removing it by default for AoT does not seem like the right trade-off.

@javiercn - I've updated Option (3) above to better explain the current proposal. We would only remove the feature by default when the app uses the new "slim" hosting API. And they can explicitly opt back into it, if they need it.

@stephentoub
Copy link
Member

We also need to specify a match timeout. There is no API that takes a Timeout, but doesn't take RegexOptions.

That's legitimate, though also suggests again that NonBacktracking is desirable... the only reason timeout exists is to protect against runaway backtracking.

@mitchdenny
Copy link
Member

OK so just caught up on this thread that got plenty of traction overnight ;) It seems like there are two points of view and we need to settle on a path forward:

  1. NonBacktracking is desirable and the ~1MB cost is an acceptable trade off
  2. NonBacktracking isn't needed in our scenario and we would rather save the ~1MB

I think if we can settle where we stand on that then we can talk about the practical solutions. If we decide on position 2 above we will need at least a modest API change so we can set a timeout. There are a few different approaches we could take there - but lets decide on a high level between 1 and 2 first.

@captainsafia captainsafia moved this from Committed to Todo in [.NET 8] Web Frameworks Jan 30, 2023
@mitchdenny mitchdenny moved this from Todo to In Progress in [.NET 8] Web Frameworks Jan 31, 2023
@eerhardt
Copy link
Member Author

@mitchdenny - unfortunately #46227 didn't fully address this size issue. RegexRouteConstraint still gets pulled into this app with this path:

image

See this code:

public static RoutePatternParameterPolicyReference Constraint(object constraint)
{
// Similar to RouteConstraintBuilder
if (constraint is IRouteConstraint policy)
{
return ParameterPolicyCore(policy);
}
else if (constraint is string content)
{
return ParameterPolicyCore(new RegexRouteConstraint("^(" + content + ")$"));
}
else
{
throw new InvalidOperationException(Resources.FormatRoutePattern_InvalidConstraintReference(
constraint ?? "null",
typeof(IRouteConstraint)));
}
}

@JamesNK
Copy link
Member

JamesNK commented Feb 18, 2023

Damn. Is this definitely the last reference?

RoutePatternFactory.Parse is a static API with no integration with DI or configuration. I think achieving the trimming goal requires a breaking change.

The RegexRouteConstraint created here could be replaced by a new RegexPlaceholderRouteConstraint. The regex placeholder constraint just captures the pattern as a string. Then, when routing builds the route table, the placeholder route constraint is replaced by looking up regex from the route constraint table and passing the placeholder's string as an argument. If a regex constraint is registered, then it is used. If the erroring constraint is resolved, then the standard error is thrown.

The breaking change is that calling RegexPlaceholderRouteConstraint.Match throws an error. It can't reference Regex so it can't properly evaluate the route. We can provide a good error message that explains how to work around the change.

The vast majority of people don't use a RoutePattern directly. They create it and register it with the app, and allow the app to figure out the right way to use it. Maybe do an API search on GitHub for people getting parameter policies from a RoutePattern?

@JamesNK
Copy link
Member

JamesNK commented Feb 18, 2023

Idea to make the change less breaking!

IRouteConstraint.Match has an optional HttpContext argument. If RegexPlaceholderRouteConstraint.Match is called, we could resolve route options from the context's request services and then look up the route constraint table with regex. The placeholder constraint can then cache the real regex constraint and use it in Match.

This isn't a 100% fix. HttpContext is optional. Matching an incoming HTTP request will always have a HttpContext, but it's possible to do link generation outside of an HTTP request, and there are overloads on the link generator where the context is optional.

@davidfowl
Copy link
Member

Or maybe we make the breaking change in AOT mode only?

@mitchdenny
Copy link
Member

How would that work - use a runtime flag?

@JamesNK
Copy link
Member

JamesNK commented Feb 20, 2023

if (!RuntimeFeature.IsDynamicCodeSupported)
{
    return new RegexRouteConstraint(pattern);
}
else
{
    return new RegexPlaceholderRouteConstraint(pattern);
}

Plus, what I mentioned about resolving regex from constraint map when building route table.

@mitchdenny
Copy link
Member

One of the things we did with the AddRouting/AddRoutingCore combo was make it possible for folks to opt back into using regex route constraints if they wanted to. We wouldn't necessarily want to block folks from using Regex in their routes completely - just by default.

So how could we opt them back in. Some kind of static Boolean that is evaluated as part of that condition...

@JamesNK
Copy link
Member

JamesNK commented Feb 20, 2023

We'd combine that check with what I described here: #46142 (comment)

Most people are unaffected. The only people impacted are those doing AOT, inferring regex constraints from strings, and using routes outside the ASP.NET Core route table.

A very small percentage of people might need to update their app to be AOT friendly, but we can throw a descriptive error message from RegexPlaceholderRouteConstraint.Match telling them what they need to change.

We could add a new build switch to opt-in/opt-out here, but I think the number of people impacted is so small that it wouldn't be worth it. And if more people run into trouble than anticipated, we could add a new switch later based on customer feedback.

@mitchdenny
Copy link
Member

OK I'll have a go at getting this working to see what it looks like.

@captainsafia
Copy link
Member

Triage: We discussed not changing the behavior based on IsDynamicCodeSupported so that behavior is uniform between AOT and non-AOT scenarios. However, we might not have the information needed to configure the behavior to be the same when the RegexConstraint is initialized.

We observed that we could make the modification in the RoutePatternFactory.PatternCore method to factor out bring in the regex constraints when the parameter policies (which introduce the problematic codepath of bringing in Regex patterns) out.

@eerhardt
Copy link
Member Author

The Regex engine still isn't being trimmed with the latest code. This is due to the refactoring that was done in #46323.

The reason is because the new Regex code has been moved out of the constructor, and instead is lazily instantiated in the Constraint property:

public Regex Constraint
{
get
{
// Create regex instance lazily to avoid compiling regexes at app startup. Delay creation until constraint is first evaluated.
// This is not thread safe. No side effect but multiple instances of a regex instance could be created from a burst of requests.
_constraint ??= new Regex(
_regexPattern,
RegexOptions.CultureInvariant | RegexOptions.Compiled | RegexOptions.IgnoreCase,
RegexMatchTimeout);
return _constraint;
}
}

And that property can't be trimmed because it is used in the Match and MatchesLiteral methods. And the whole class can't be trimmed because it is still used by AlphaRouteConstraint.

@eerhardt eerhardt reopened this Feb 22, 2023
@captainsafia captainsafia moved this from Done to Todo in [.NET 8] Web Frameworks Feb 22, 2023
eerhardt added a commit to eerhardt/aspnetcore that referenced this issue Feb 22, 2023
Move the lazy Regex creation code to a delegate that is only set with the RegexRouteConstraint constructor that takes a string regexPattern. This allows for the Regex engine to be trimmed when the regexPattern constructor is trimmed.

Fix dotnet#46142
eerhardt added a commit that referenced this issue Feb 23, 2023
* Allow Regex engine to be trimmed

Move the lazy Regex creation code to a delegate that is only set with the RegexRouteConstraint constructor that takes a string regexPattern. This allows for the Regex engine to be trimmed when the regexPattern constructor is trimmed.

Fix #46142
@captainsafia captainsafia moved this from Todo to Done in [.NET 8] Web Frameworks Feb 23, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Mar 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-web-frameworks *DEPRECATED* This label is deprecated in favor of the area-mvc and area-minimal labels feature-routing
Projects
No open projects
Status: Done
9 participants