Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create class for reading Json files in chunks #5530
Create class for reading Json files in chunks #5530
Changes from all commits
d1e4d8a
9ec0869
cfa2169
c1753f9
cc2ad30
32ec713
a9940e9
6f87583
0f75860
5c4269a
c469899
d0f9f5e
d3e6ab8
2d7cba8
4192d9b
f67239b
ca6e1d7
fa9639d
997f199
3e4146c
b403ed8
a1c4844
4ff0f7e
a9884ec
7a467d5
86d3524
c559e69
74b2e54
0c05eb8
a233b40
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utf8JsonStreamReader
struct appears to have dual responsibilities violating single responsibility principle.Stream Wrapper: The struct acts as a wrapper around a stream, specifically for the purpose of reading it in chunks rather than loading the entire stream into memory.
Utf8JsonReader Wrapper: The struct also wraps the functionality of Utf8JsonReader. This includes functionalities like reading various data types (strings, integers, booleans) from JSON, handling different JSON token types, and managing the state of the JSON reader.
How about splitting it into 2 structs, one handles stream buffer, resizing the buffer etc., and another one wraps UTF8JsonReader methods such as GetString(), GetBoolean() etc.,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a large change if we want to do this I would like to do it in a new PR after the last one is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor throws following exception when
Array.Empty<byte>()
is passed to the memory stream. I am just letting you know incase if you would like to add a test.System.Text.Json.JsonReaderException : '0x00' is an invalid start of a value. LineNumber: 0 | BytePositionInLine: 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test to demonstrate this behavior but I don't think it's worthwhile adding validation for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases, which I currently don't understand, the STJ implementation raises an exception when
IsFinalBlock
is set to true. Therefore, it would be better to check this value first before invoking the underlyingRead
method. Another advantage is that readingIsFinalBlock
is an O(1) operation.https://github.com/dotnet/runtime/blob/3a5bea5d60ea04b897ac968a358ca99a1189d368/src/libraries/System.Text.Json/src/System/Text/Json/Reader/Utf8JsonReader.cs#L269-L289
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we check for the final block first then we won't all the read at all. The exception you're seeing there means that the Utf8JsonReader was told it has all the data but the current property is none. None meaning that there is no JSON data in the reader. If we want to check for the scenario we can, I would think it should be in the constructor though not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using the Stream.ReadAsync Method instead of the blocking Read method? It would be great if we could use the overload that accepts a
CancellationToken
.Docs state that for an asynchronous version of example, see .NET samples JSON project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have links handy, but my understanding is that on .NET Framework at least, files without passing the async flag to the operating system, is that there's a non-trivial perf impact to reading the files with async APIs. Similarly, files opened (from the OS point of view) with the async flag, but then using blocking APIs has a non-trivial perf impact. Also, .NET Framework's async scheduling just isn't nearly as good as .NET 6+'s, so even when async file IO is done consistently, when the delays are low (OS has file cached in memory?) then there's still going to be a measurable perf impact.
If this suggestion, using Stream.ReadAsync, is being seriously considered, I'd recommend reaching out to the VS Perf team and checking if my believe about async file IO on .NET Framework is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Async isn't about making things faster; its about yielding thread pool threads while its blocking, which lets us make sure we're efficiently using the CPU by having only CPU work on them.
NuGet during restore spends a lot of time blocking thread pools threads (see Thread Pool watson, its #13, #14, #19, #22 in 17.8), so async where possible should be the approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @davkean good to know. It's a shame that most of the thread pool watson buckets are caused by .NET & Windows APIs where there are no async alternatives. But we should make time to investigate async file reads (where APIs exist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the need for the do-while loop inside the
ReadStreamIntoBuffer
method, especially considering thatReadStreamIntoBuffer
is already invoked within a while loop in theRead
method. If the buffer is smaller than the stream's content, or if the buffer has more capacity than what's currently available in the stream, the first call to_stream.Read
inReadStreamIntoBuffer
should be enough to fill the buffer up to its capacity or the stream's current availability. Is the do-while loop necessary in this case, or could it lead to redundant read operations?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in a loop because of Andy's comment here: #5530 (comment).
There isn't any guarantee that the stream is actually read to the limit requested. So the loop is necessary to ensure we are reading until the buffer is full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://learn.microsoft.com/dotnet/standard/serialization/system-text-json/use-utf8jsonreader doc suggests to use
_reader.CurrentState
. If this suggestion is valid for our scenario, then we can remove thejsonReaderState
parameter from the method definition.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is taking in the jsonReaderState because in the constructor the state doesn't already exist.