Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create class for reading Json files in chunks #5530

Conversation

jgonz120
Copy link
Contributor

@jgonz120 jgonz120 commented Dec 5, 2023

Bug

Fixes:

Regression? Last working version:

Description

These classes and structs serve as the base for parsing json objects in memory chunks. I plan on deleting all unnecessary commented code in the final pr.

Utf8JsonStreamReader

This struct is a wrapper of the Utf8JsonReader. It maintains the array buffer which stores the current memory chunk being parsed. It wraps the read and write functions from Utf8JsonReader to determine if we've reached the end of the current stream and move the buffer down.

Utf8JsonStreamReaderConverter

This is used to facilitate parsing object arrays. STJ uses a struct for the data we are reading, which means we cannot use anonymous functions or delegates anymore. This class will be used more when I create the PR for parsing the asset files.
 

PR Checklist

  • PR has a meaningful title

  • PR has a linked issue.

  • Described changes

  • Tests

    • Automated tests added
    • OR
    • Test exception
    • OR
    • N/A
  • Documentation

    • Documentation PR or issue filled
    • OR
    • N/A

@jgonz120 jgonz120 requested a review from a team as a code owner December 5, 2023 21:03
Copy link
Member

@zivkan zivkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about the number of comments I'm adding. I hope they're not too annoying/opinionated. I do have some comments in this round about possible correctness issues that I'd like to see addressed before I'd feel comfortable signing off on this PR.

src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Outdated Show resolved Hide resolved
src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Outdated Show resolved Hide resolved
src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Outdated Show resolved Hide resolved
src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Outdated Show resolved Hide resolved
src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Outdated Show resolved Hide resolved
src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Outdated Show resolved Hide resolved
src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Outdated Show resolved Hide resolved
Copy link
Member

@zivkan zivkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for making all the improvements!

src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Outdated Show resolved Hide resolved
src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs Outdated Show resolved Hide resolved
@jgonz120 jgonz120 merged commit 22f3566 into dev-jgonz120-FeatureBranch-Stj-Migration Dec 12, 2023
1 check passed
@jgonz120 jgonz120 deleted the dev-jgonz120-create-init-stj-files branch December 12, 2023 18:30
jgonz120 added a commit that referenced this pull request Dec 20, 2023
* Moved files over and addressed some PR comments

* added comment

* switched to true and false strings

* Added ctr to specify buffer for testing purposes.

* remove commented code

* switch to use Utf8 preamble for BOM

* Create method for checking complete

* combined code for ReadStringArray

* Updated buffer size to match STJ's default buffer size

* Switch Utf8JsonStreamReader to be disposable.

* Switch to read the value for numbers into a string directly

* revert back to using private var for utf8Bom

* Remove ReadStringArrayAsList

* Avoid referencing buffer after  returning

* Actually avoid referencing _buffer after returning

* Update how buffers are fed into Utf8JsonReader to avoid feeding extra empty data.

* remove extra line

* Reverted back to using try get int for ReadTokenAsString

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* Remove ValueTextEquals taking in string

* Switched to Skip instead of TrySkip

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* Added some unit tests

* fix Bom

* Switched to using Moq

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* loop through stream when reading to ensure reading full bytes or to the end

* update signature comment

* Switched stream back to field and supress warning

---------

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
Copy link
Contributor

@kartheekp-ms kartheekp-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I began reviewing the #5541, but I realized that I lack sufficient context regarding the STJ API introduced in this PR. Hence, I started reviewing this PR. During this review, I identified several areas for improvement and have added comments to the PR. I apologize for the delay in providing this feedback. Since this PR has already been merged into a feature branch, please consider creating a follow-up PR to address these comments.

do
{
var spaceLeftInBuffer = _buffer.Length - _bufferUsed;
bytesRead = _stream.Read(_buffer, _bufferUsed, spaceLeftInBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using the Stream.ReadAsync Method instead of the blocking Read method? It would be great if we could use the overload that accepts a CancellationToken.

Docs state that for an asynchronous version of example, see .NET samples JSON project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have links handy, but my understanding is that on .NET Framework at least, files without passing the async flag to the operating system, is that there's a non-trivial perf impact to reading the files with async APIs. Similarly, files opened (from the OS point of view) with the async flag, but then using blocking APIs has a non-trivial perf impact. Also, .NET Framework's async scheduling just isn't nearly as good as .NET 6+'s, so even when async file IO is done consistently, when the delays are low (OS has file cached in memory?) then there's still going to be a measurable perf impact.

If this suggestion, using Stream.ReadAsync, is being seriously considered, I'd recommend reaching out to the VS Perf team and checking if my believe about async file IO on .NET Framework is wrong.

Copy link
Contributor

@davkean davkean Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async isn't about making things faster; its about yielding thread pool threads while its blocking, which lets us make sure we're efficiently using the CPU by having only CPU work on them.

NuGet during restore spends a lot of time blocking thread pools threads (see Thread Pool watson, its #13, #14, #19, #22 in 17.8), so async where possible should be the approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davkean good to know. It's a shame that most of the thread pool watson buckets are caused by .NET & Windows APIs where there are no async alternatives. But we should make time to investigate async file reads (where APIs exist).

_bufferUsed += bytesRead;
}
while (bytesRead != 0 && _bufferUsed != _buffer.Length);
_reader = new Utf8JsonReader(_buffer.AsSpan(0, _bufferUsed), isFinalBlock: bytesRead == 0, jsonReaderState);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://learn.microsoft.com/dotnet/standard/serialization/system-text-json/use-utf8jsonreader doc suggests to use _reader.CurrentState. If this suggestion is valid for our scenario, then we can remove the jsonReaderState parameter from the method definition.

Suggested change
_reader = new Utf8JsonReader(_buffer.AsSpan(0, _bufferUsed), isFinalBlock: bytesRead == 0, jsonReaderState);
_reader = new Utf8JsonReader(_buffer.AsSpan(0, _bufferUsed), isFinalBlock: bytesRead == 0, _reader.CurrentState);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is taking in the jsonReaderState because in the constructor the state doesn't already exist.

bytesRead = _stream.Read(_buffer, _bufferUsed, spaceLeftInBuffer);
_bufferUsed += bytesRead;
}
while (bytesRead != 0 && _bufferUsed != _buffer.Length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the need for the do-while loop inside the ReadStreamIntoBuffer method, especially considering that ReadStreamIntoBuffer is already invoked within a while loop in the Read method. If the buffer is smaller than the stream's content, or if the buffer has more capacity than what's currently available in the stream, the first call to _stream.Read in ReadStreamIntoBuffer should be enough to fill the buffer up to its capacity or the stream's current availability. Is the do-while loop necessary in this case, or could it lead to redundant read operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in a loop because of Andy's comment here: #5530 (comment).

There isn't any guarantee that the stream is actually read to the limit requested. So the loop is necessary to ensure we are reading until the buffer is full.

/// This struct is used to read over a memeory stream in parts, in order to avoid reading the entire stream into memory.
/// It functions as a wrapper around <see cref="Utf8JsonStreamReader"/>, while maintaining a stream and a buffer to read from.
/// </summary>
internal ref struct Utf8JsonStreamReader
Copy link
Contributor

@kartheekp-ms kartheekp-ms Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utf8JsonStreamReader struct appears to have dual responsibilities violating single responsibility principle.

Stream Wrapper: The struct acts as a wrapper around a stream, specifically for the purpose of reading it in chunks rather than loading the entire stream into memory.

Utf8JsonReader Wrapper: The struct also wraps the functionality of Utf8JsonReader. This includes functionalities like reading various data types (strings, integers, booleans) from JSON, handling different JSON token types, and managing the state of the JSON reader.

How about splitting it into 2 structs, one handles stream buffer, resizing the buffer etc., and another one wraps UTF8JsonReader methods such as GetString(), GetBoolean() etc.,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a large change if we want to do this I would like to do it in a new PR after the last one is complete.

private ArrayPool<byte> _bufferPool;
private int _bufferUsed = 0;

internal Utf8JsonStreamReader(Stream stream, int bufferSize = BufferSizeDefault, ArrayPool<byte> arrayPool = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor throws following exception when Array.Empty<byte>() is passed to the memory stream. I am just letting you know incase if you would like to add a test.

 using (var stream = new MemoryStream(Array.Empty<byte>()))
 using (var reader = new Utf8JsonStreamReader(stream))
{
}

System.Text.Json.JsonReaderException : '0x00' is an invalid start of a value. LineNumber: 0 | BytePositionInLine: 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to demonstrate this behavior but I don't think it's worthwhile adding validation for this.

jgonz120 added a commit that referenced this pull request Jan 27, 2024
* Moved files over and addressed some PR comments

* added comment

* switched to true and false strings

* Added ctr to specify buffer for testing purposes.

* remove commented code

* switch to use Utf8 preamble for BOM

* Create method for checking complete

* combined code for ReadStringArray

* Updated buffer size to match STJ's default buffer size

* Switch Utf8JsonStreamReader to be disposable.

* Switch to read the value for numbers into a string directly

* revert back to using private var for utf8Bom

* Remove ReadStringArrayAsList

* Avoid referencing buffer after  returning

* Actually avoid referencing _buffer after returning

* Update how buffers are fed into Utf8JsonReader to avoid feeding extra empty data.

* remove extra line

* Reverted back to using try get int for ReadTokenAsString

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* Remove ValueTextEquals taking in string

* Switched to Skip instead of TrySkip

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* Added some unit tests

* fix Bom

* Switched to using Moq

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* loop through stream when reading to ensure reading full bytes or to the end

* update signature comment

* Switched stream back to field and supress warning

---------

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
jgonz120 added a commit that referenced this pull request Feb 5, 2024
* Moved files over and addressed some PR comments

* added comment

* switched to true and false strings

* Added ctr to specify buffer for testing purposes.

* remove commented code

* switch to use Utf8 preamble for BOM

* Create method for checking complete

* combined code for ReadStringArray

* Updated buffer size to match STJ's default buffer size

* Switch Utf8JsonStreamReader to be disposable.

* Switch to read the value for numbers into a string directly

* revert back to using private var for utf8Bom

* Remove ReadStringArrayAsList

* Avoid referencing buffer after  returning

* Actually avoid referencing _buffer after returning

* Update how buffers are fed into Utf8JsonReader to avoid feeding extra empty data.

* remove extra line

* Reverted back to using try get int for ReadTokenAsString

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* Remove ValueTextEquals taking in string

* Switched to Skip instead of TrySkip

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* Added some unit tests

* fix Bom

* Switched to using Moq

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* loop through stream when reading to ensure reading full bytes or to the end

* update signature comment

* Switched stream back to field and supress warning

---------

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
jgonz120 added a commit that referenced this pull request Feb 5, 2024
* Moved files over and addressed some PR comments

* added comment

* switched to true and false strings

* Added ctr to specify buffer for testing purposes.

* remove commented code

* switch to use Utf8 preamble for BOM

* Create method for checking complete

* combined code for ReadStringArray

* Updated buffer size to match STJ's default buffer size

* Switch Utf8JsonStreamReader to be disposable.

* Switch to read the value for numbers into a string directly

* revert back to using private var for utf8Bom

* Remove ReadStringArrayAsList

* Avoid referencing buffer after  returning

* Actually avoid referencing _buffer after returning

* Update how buffers are fed into Utf8JsonReader to avoid feeding extra empty data.

* remove extra line

* Reverted back to using try get int for ReadTokenAsString

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* Remove ValueTextEquals taking in string

* Switched to Skip instead of TrySkip

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* Added some unit tests

* fix Bom

* Switched to using Moq

* Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>

* loop through stream when reading to ensure reading full bytes or to the end

* update signature comment

* Switched stream back to field and supress warning

---------

Co-authored-by: Andy Zivkovic <zivkan@users.noreply.github.com>
jgonz120 added a commit that referenced this pull request Feb 12, 2024
Crated a new struct to parse the assets file using System.Text.Json instead of Newtonsoft. It will read the file without loading it completely in memory, reducing memory allocations. 

PRs associated with this:
* Create class for reading Json files in chunks (#5530)
* System.Text.Json Migration - Adding code to parse the PackageSpec using STJ (#5541)
* System.Text.Json Migration - Adding code to parse the Project.Assets.Json file using STJ.  (#5558)
* Create class for reading Json files in chunks (#5530)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants