-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create class for reading Json files in chunks #5530
Create class for reading Json files in chunks #5530
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry about the number of comments I'm adding. I hope they're not too annoying/opinionated. I do have some comments in this round about possible correctness issues that I'd like to see addressed before I'd feel comfortable signing off on this PR.
Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for making all the improvements!
22f3566
into
dev-jgonz120-FeatureBranch-Stj-Migration
* Moved files over and addressed some PR comments * added comment * switched to true and false strings * Added ctr to specify buffer for testing purposes. * remove commented code * switch to use Utf8 preamble for BOM * Create method for checking complete * combined code for ReadStringArray * Updated buffer size to match STJ's default buffer size * Switch Utf8JsonStreamReader to be disposable. * Switch to read the value for numbers into a string directly * revert back to using private var for utf8Bom * Remove ReadStringArrayAsList * Avoid referencing buffer after returning * Actually avoid referencing _buffer after returning * Update how buffers are fed into Utf8JsonReader to avoid feeding extra empty data. * remove extra line * Reverted back to using try get int for ReadTokenAsString * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * Remove ValueTextEquals taking in string * Switched to Skip instead of TrySkip * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * Added some unit tests * fix Bom * Switched to using Moq * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * loop through stream when reading to ensure reading full bytes or to the end * update signature comment * Switched stream back to field and supress warning --------- Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I began reviewing the #5541, but I realized that I lack sufficient context regarding the STJ API introduced in this PR. Hence, I started reviewing this PR. During this review, I identified several areas for improvement and have added comments to the PR. I apologize for the delay in providing this feedback. Since this PR has already been merged into a feature branch, please consider creating a follow-up PR to address these comments.
do | ||
{ | ||
var spaceLeftInBuffer = _buffer.Length - _bufferUsed; | ||
bytesRead = _stream.Read(_buffer, _bufferUsed, spaceLeftInBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using the Stream.ReadAsync Method instead of the blocking Read method? It would be great if we could use the overload that accepts a CancellationToken
.
Docs state that for an asynchronous version of example, see .NET samples JSON project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have links handy, but my understanding is that on .NET Framework at least, files without passing the async flag to the operating system, is that there's a non-trivial perf impact to reading the files with async APIs. Similarly, files opened (from the OS point of view) with the async flag, but then using blocking APIs has a non-trivial perf impact. Also, .NET Framework's async scheduling just isn't nearly as good as .NET 6+'s, so even when async file IO is done consistently, when the delays are low (OS has file cached in memory?) then there's still going to be a measurable perf impact.
If this suggestion, using Stream.ReadAsync, is being seriously considered, I'd recommend reaching out to the VS Perf team and checking if my believe about async file IO on .NET Framework is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Async isn't about making things faster; its about yielding thread pool threads while its blocking, which lets us make sure we're efficiently using the CPU by having only CPU work on them.
NuGet during restore spends a lot of time blocking thread pools threads (see Thread Pool watson, its #13, #14, #19, #22 in 17.8), so async where possible should be the approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @davkean good to know. It's a shame that most of the thread pool watson buckets are caused by .NET & Windows APIs where there are no async alternatives. But we should make time to investigate async file reads (where APIs exist).
_bufferUsed += bytesRead; | ||
} | ||
while (bytesRead != 0 && _bufferUsed != _buffer.Length); | ||
_reader = new Utf8JsonReader(_buffer.AsSpan(0, _bufferUsed), isFinalBlock: bytesRead == 0, jsonReaderState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://learn.microsoft.com/dotnet/standard/serialization/system-text-json/use-utf8jsonreader doc suggests to use _reader.CurrentState
. If this suggestion is valid for our scenario, then we can remove the jsonReaderState
parameter from the method definition.
_reader = new Utf8JsonReader(_buffer.AsSpan(0, _bufferUsed), isFinalBlock: bytesRead == 0, jsonReaderState); | |
_reader = new Utf8JsonReader(_buffer.AsSpan(0, _bufferUsed), isFinalBlock: bytesRead == 0, _reader.CurrentState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is taking in the jsonReaderState because in the constructor the state doesn't already exist.
bytesRead = _stream.Read(_buffer, _bufferUsed, spaceLeftInBuffer); | ||
_bufferUsed += bytesRead; | ||
} | ||
while (bytesRead != 0 && _bufferUsed != _buffer.Length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the need for the do-while loop inside the ReadStreamIntoBuffer
method, especially considering that ReadStreamIntoBuffer
is already invoked within a while loop in the Read
method. If the buffer is smaller than the stream's content, or if the buffer has more capacity than what's currently available in the stream, the first call to _stream.Read
in ReadStreamIntoBuffer
should be enough to fill the buffer up to its capacity or the stream's current availability. Is the do-while loop necessary in this case, or could it lead to redundant read operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in a loop because of Andy's comment here: #5530 (comment).
There isn't any guarantee that the stream is actually read to the limit requested. So the loop is necessary to ensure we are reading until the buffer is full.
test/NuGet.Core.Tests/NuGet.ProjectModel.Test/Utf8JsonStreamReaderTests.cs
Show resolved
Hide resolved
/// This struct is used to read over a memeory stream in parts, in order to avoid reading the entire stream into memory. | ||
/// It functions as a wrapper around <see cref="Utf8JsonStreamReader"/>, while maintaining a stream and a buffer to read from. | ||
/// </summary> | ||
internal ref struct Utf8JsonStreamReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utf8JsonStreamReader
struct appears to have dual responsibilities violating single responsibility principle.
Stream Wrapper: The struct acts as a wrapper around a stream, specifically for the purpose of reading it in chunks rather than loading the entire stream into memory.
Utf8JsonReader Wrapper: The struct also wraps the functionality of Utf8JsonReader. This includes functionalities like reading various data types (strings, integers, booleans) from JSON, handling different JSON token types, and managing the state of the JSON reader.
How about splitting it into 2 structs, one handles stream buffer, resizing the buffer etc., and another one wraps UTF8JsonReader methods such as GetString(), GetBoolean() etc.,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a large change if we want to do this I would like to do it in a new PR after the last one is complete.
private ArrayPool<byte> _bufferPool; | ||
private int _bufferUsed = 0; | ||
|
||
internal Utf8JsonStreamReader(Stream stream, int bufferSize = BufferSizeDefault, ArrayPool<byte> arrayPool = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor throws following exception when Array.Empty<byte>()
is passed to the memory stream. I am just letting you know incase if you would like to add a test.
using (var stream = new MemoryStream(Array.Empty<byte>()))
using (var reader = new Utf8JsonStreamReader(stream))
{
}
System.Text.Json.JsonReaderException : '0x00' is an invalid start of a value. LineNumber: 0 | BytePositionInLine: 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test to demonstrate this behavior but I don't think it's worthwhile adding validation for this.
* Moved files over and addressed some PR comments * added comment * switched to true and false strings * Added ctr to specify buffer for testing purposes. * remove commented code * switch to use Utf8 preamble for BOM * Create method for checking complete * combined code for ReadStringArray * Updated buffer size to match STJ's default buffer size * Switch Utf8JsonStreamReader to be disposable. * Switch to read the value for numbers into a string directly * revert back to using private var for utf8Bom * Remove ReadStringArrayAsList * Avoid referencing buffer after returning * Actually avoid referencing _buffer after returning * Update how buffers are fed into Utf8JsonReader to avoid feeding extra empty data. * remove extra line * Reverted back to using try get int for ReadTokenAsString * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * Remove ValueTextEquals taking in string * Switched to Skip instead of TrySkip * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * Added some unit tests * fix Bom * Switched to using Moq * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * loop through stream when reading to ensure reading full bytes or to the end * update signature comment * Switched stream back to field and supress warning --------- Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
* Moved files over and addressed some PR comments * added comment * switched to true and false strings * Added ctr to specify buffer for testing purposes. * remove commented code * switch to use Utf8 preamble for BOM * Create method for checking complete * combined code for ReadStringArray * Updated buffer size to match STJ's default buffer size * Switch Utf8JsonStreamReader to be disposable. * Switch to read the value for numbers into a string directly * revert back to using private var for utf8Bom * Remove ReadStringArrayAsList * Avoid referencing buffer after returning * Actually avoid referencing _buffer after returning * Update how buffers are fed into Utf8JsonReader to avoid feeding extra empty data. * remove extra line * Reverted back to using try get int for ReadTokenAsString * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * Remove ValueTextEquals taking in string * Switched to Skip instead of TrySkip * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * Added some unit tests * fix Bom * Switched to using Moq * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * loop through stream when reading to ensure reading full bytes or to the end * update signature comment * Switched stream back to field and supress warning --------- Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
* Moved files over and addressed some PR comments * added comment * switched to true and false strings * Added ctr to specify buffer for testing purposes. * remove commented code * switch to use Utf8 preamble for BOM * Create method for checking complete * combined code for ReadStringArray * Updated buffer size to match STJ's default buffer size * Switch Utf8JsonStreamReader to be disposable. * Switch to read the value for numbers into a string directly * revert back to using private var for utf8Bom * Remove ReadStringArrayAsList * Avoid referencing buffer after returning * Actually avoid referencing _buffer after returning * Update how buffers are fed into Utf8JsonReader to avoid feeding extra empty data. * remove extra line * Reverted back to using try get int for ReadTokenAsString * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * Remove ValueTextEquals taking in string * Switched to Skip instead of TrySkip * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * Added some unit tests * fix Bom * Switched to using Moq * Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com> * loop through stream when reading to ensure reading full bytes or to the end * update signature comment * Switched stream back to field and supress warning --------- Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
Crated a new struct to parse the assets file using System.Text.Json instead of Newtonsoft. It will read the file without loading it completely in memory, reducing memory allocations. PRs associated with this: * Create class for reading Json files in chunks (#5530) * System.Text.Json Migration - Adding code to parse the PackageSpec using STJ (#5541) * System.Text.Json Migration - Adding code to parse the Project.Assets.Json file using STJ. (#5558) * Create class for reading Json files in chunks (#5530)
Bug
Fixes:
Regression? Last working version:
Description
These classes and structs serve as the base for parsing json objects in memory chunks. I plan on deleting all unnecessary commented code in the final pr.
Utf8JsonStreamReader
This struct is a wrapper of the Utf8JsonReader. It maintains the array buffer which stores the current memory chunk being parsed. It wraps the read and write functions from Utf8JsonReader to determine if we've reached the end of the current stream and move the buffer down.
Utf8JsonStreamReaderConverter
This is used to facilitate parsing object arrays. STJ uses a struct for the data we are reading, which means we cannot use anonymous functions or delegates anymore. This class will be used more when I create the PR for parsing the asset files.
PR Checklist
PR has a meaningful title
PR has a linked issue.
Described changes
Tests
Documentation