-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API request: string.Intern(ReadOnlySpan<char> ...) #28368
Comments
How are these supposed to work without allocating a string? You mean if a matching string is already in the intern pool? |
@khellang But today we have to do this: Token ProcessToken(Span<char> token)
{
var temp = new String(token);
return new Token {
Value = string.Intern(temp);
}
}
This would be nicer: Token ProcessToken(Span<char> token)
{
return new Token {
Value = string.Intern(token);
}
} |
Note that You typically get much better performance results by using a local interning table that does imperfect interning (i.e. it is ok for it to create duplicate strings in rare cases - hash collisions, multiple threads competing to create the same string, etc.) We can still add this API, but it is unlikely to be very useful to actually make things faster. |
Wouldn't we always recommend against |
@jkotas @danmosemsft now that the C# compiler will be getting module initializers, and it already has support for embedding byte arrays in PE file can we please revisit this? With module initializers there becomes a more defined place (even if you consider it as a convention) to do this work. This would be very useful to the Bing team. |
I do not see how module initializers change the equation on this. Could you please provide details for how you would like to use this API, and what do you expect to save by doing that? |
We have a ton of string data (not strings, but UTF-16 bytes in our PE files). We would use module initializers to access this data in the file and call string.Intern on them. |
Ok, I have been thinking about this and I see situations where this may be useful. |
A big concern of mine is that folks might see a spanified version of ReadOnlySpan<char> requestLine = GetRequestLine(); // = "GET /path HTTP/1.1" (as a span)
ReadOnlySpan<char> httpVerb = requestLine[..requestLine.IndexOf(' ')]; // = "GET" (as a span)
string verbAsInternedString = string.Intern(httpVerb); // = "GET" (as a string) Basically, the sample above shows an "allocation-free" way of extracting the literal strings "GET" / "POST" / etc. from the request line, but it introduces a security flaw in the application. This allows the request payload to send arbitrary nonsensical verbs to the application, and those verbs will get added to a static singleton table which is never cleared. Eventually this will result in OOMing the web server process: a denial of service attack. To discourage such use, I would instead recommend the following API: public sealed class string
{
public static bool TryGetInternedString(ReadOnlySpan<char> str, [NotNullWhen(true)] out string? internedString);
} This allows you to query whether the provided buffer exists in the intern table using a familiar try pattern. If the buffer does not exist in the intern table, it outs null. If you want to add the buffer to the intern table, you're forced to bounce through |
@GrabYourPitchforks Good catch as always.
The pit of failure with API like this is that it depends on the global intern table being populated with the string you care about. The population of the global intern table is not very predictable. For example, the behavior of this method would depend on how the JIT inliner works (inlining happens to populate the global intern table as side-effect). The existing This suggests that we should go back to the drawing board with this API. |
This isn't quite my area of expertise. But doesn't this mean that if somebody calls |
Rather than integrating this with string.Intern, at least the scenarios I'm aware of here would be addressed if we had a |
I'm not clear why we would want to double down on String.Intern, which was sort of recommended against as long as I can remember for reasons mentioned above and one of those is the one Levi points to. What might be worth doing is offering a first class interning interface that is pluggable. This was discussed here already: |
The scenario where @mjsabby 's wanted to use this API was a static initialization of large configuration data. It does not have the problem with denial-of-service and there are benefits from interning the program string constants and the configuration data keys against the same pool. It may be too niche case to warrant adding the API. |
I am not convinced we would want to create a new API for this case. The scenarios do see to be in niche cases, and there are risks of misuse that have been called out. @mjsabby, @ladeak, and @AlgorithmsAreCool -- I am inclined to close this issue. Please let me know if you have strong objections to closing it. |
I say close it. I've stopped using trying to use string.intern. I almost think it should be marked Obsolete. I'm thinking about opening a separate issue for a better string caching API in general. |
Since this was closed; I wrote a package called public class InternPool
{
[return: NotNullIfNotNull("value")]
public string? Intern(string? value);
public string InternAscii(ReadOnlySpan<byte> asciiValue);
public string InternUtf8(ReadOnlySpan<byte> utf8Value);
public string Intern(ReadOnlySpan<char> value);
} Isn't thread safe so you'll have to handle your own locking; and haven't written tests yet so don't know if it works 😅; so YMMV |
Isn't that what ... I'll show myself out... |
Added a test, fixed some bugs, bumped to 0.0.2 |
String.Intern currently accepts a string parameter. When processing a
ReadOnlySpan<char>
it would be nice to be able to intern a string without a necessary string allocation upfront. This would be especially useful for strings that would be used frequently.For this reason I would request the following overloads added to string:
The text was updated successfully, but these errors were encountered: