Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse integers in a string based on an index #16933

Closed
hughbe opened this issue Apr 8, 2016 · 15 comments
Closed

Parse integers in a string based on an index #16933

hughbe opened this issue Apr 8, 2016 · 15 comments
Assignees
Milestone

Comments

@hughbe
Copy link
Contributor

hughbe commented Apr 8, 2016

Summary

Currently parsing integers takes a string, and parses the entirety of the string. I propose adding APIs to parse a string from an index, with an additional overload specifying the length.

Rationale

In order to parse part of a string, we need to perform a string.Substring(...) method on the string. This results in allocations, affecting performance.

Proposed APIs - with slices

public int Parse(ReadOnlySlice<char> s)
public int Parse(ReadOnlySlice<char> s, IFormatProvider format);

public bool TryParse(ReadOnlySlice<char> s, out int i);
public bool TryParse(ReadOnlySlice<char> s, IFormatProvider format, out int i);

Proposed APIs - without slices

public class Int32
{
    // Proposed new APIs
    public int Parse(string s, int index);

    // Either
     public int Parse(string s, int index, int count);
     // or
     Public int Parse(string s, int index, int length);
     // or
     public int Parse(string s, int startIndex, int endIndex);

    // Existing parsing APIs
    public int Parse(string);
    ...
}

Discussion

  • These overloads should be intelligent and not allocate any substrings
  • I assume TryParse overloads would also have to be added
  • Would we need to add overloads taking an IFormatProvider too
  • How common is this use case?
  • Is this worth adding 4 public methods to each integer type?
  • Should we bother with the Parse(string, int) overload? I think most people calling his API would know the required length.

Example

I added some custom parsing code in a PR in coreclr (dotnet/coreclr#3992). This is basically the code that these proposed methods would use.
Parsing performance of ints in Version increased by up to 4x due to reduced substring allocations.

Another example is Guid (dotnet/coreclr#3965) where performance of parsing Guid strings increased by up to 3x with the allocation count reducing to 0.

Alternatives

  • Add a Parse(char* s, int count) method instead but this is rare
@svick
Copy link
Contributor

svick commented Apr 8, 2016

I would wait to see what happens with slices (dotnet/roslyn#120, experimental implementation in corefxlabs). With slices, the new overload could be just:

public int Parse(ReadOnlySlice<char> s);

@hughbe
Copy link
Contributor Author

hughbe commented Apr 8, 2016

Oh nice one I haven't heard of those!
My only issue is - is this a C# only feature? Remember, non-C# users consume .NET core APIs too

@svick
Copy link
Contributor

svick commented Apr 8, 2016

I think special syntax (possibly something like T[:]) would be a C# feature, but the rest would be usable from any .Net language.

@hughbe
Copy link
Contributor Author

hughbe commented Apr 8, 2016

Ah okay thanks for the heads up! I'll update the PR 59 propose the ReadOnlySlice addition

@bartdesmet
Copy link
Contributor

+1 for slices.

FWIW, we built a complete StringSegment struct containing a string, start index, and length, over here to avoid such allocations. The context of this work was for a fast lexer and parser that highly reduced allocations that'd otherwise be incurred for an intermediate representation that gets dropped on the floor "quite soon" by subsequent phases that build the final result (where "quite soon" is not soon enough to remain Gen0 garbage but late enough to let them age).

The nice thing is that many operations can be lifted over these without incurring allocations, e.g. Substring, Compare, Equals, StartsWith, IndexOf, Trim, etc. Operations like Split can also be built efficiently to produce an array of segments and operations like Concat can merge adjacent segments.

More advanced operations such as "rebasing" can be built too, where an array of string segments is passed to a function that analyzes the spans, trims the underlying strings into substrings that have a high coverage by segments on top (say an 80% threshold), and substitutes the segments to refer to these newly allocated smaller strings. This is very useful in text extraction procedures where a bunch of intermediate operations perform parsing, substring operations, trim operations, etc. and ultimately want to keep the results alive but a) without keeping the whole original document alive, and b) without incurring a lot of intermediate allocations.

Where it falls down is indeed in the lack of overloads to Parse and TryParse functions in various places. Another drawback could be that these StringSegment values wouldn't work naturally on interop boundaries (the underlying char* to the first character could be used, but the segment wouldn't be \0 terminated, so it'd need to pass the length as well).

@natalie-o-perret
Copy link

@svick
+1 for slices syntactic sugar even though it reminds me Python =]

@hughbe
Copy link
Contributor Author

hughbe commented Apr 9, 2016

@bartdesmet
Thanks for that - seems like slices would be ideal here.
Obviously the .NET team is working on releasing core 1.0, so this is probably on the back burner right? without pressurising, are there plans to include such APIs soon or will this remain as an experiment for a whole

@danmoseley
Copy link
Member

@KrzysztofCwalina can comment on plans for slices in CoreFX proper.

@hughbe
Copy link
Contributor Author

hughbe commented Oct 28, 2016

It would be pretty cool if we could call APIs taking string with Slice<T> without any additional substrings or allocations - i.e map string[0] to slice[0].
Not sure if this will be possible or not, but it would certainly avoid the need to add a number of APIs!

@thomaslevesque
Copy link
Member

Related: #14802

@danmoseley
Copy link
Member

@terrajobst @KrzysztofCwalina do we want to contemplate "regular" API for this, or defer until slices?

@KrzysztofCwalina
Copy link
Member

I would do regular (string, int, int) APIs in corfx. We will do Span APIs as an OOB package.

@danmoseley
Copy link
Member

OK. @alexperovich @joperezr please either give feedback or mark as api-ready-for-review.

@hughbe looks like you have this just for ints, do you believe that's sufficient to get almost all the benefit, or are you proposing to add it to all the other primitive types?

@joperezr
Copy link
Member

API looks good to me, so I'll mark as ready for review and we'll check it on triage.

@terrajobst
Copy link
Member

This is an area that we're actively working on in CoreFxLab (see roadmap). For details on how we think about parsing, take a look at this speclet.

Given that everything around spans is experimental, it's too early to do regular API reviews against CoreFx, that's why I'm closing it here.

However, @hughbe, I'd love to continue the discussion over in CoreFxLab! 😄

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.0.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants