Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create class for reading Json files in chunks #5530

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
d1e4d8a
Moved files over and addressed some PR comments
jgonz120 Dec 5, 2023
9ec0869
added comment
jgonz120 Dec 5, 2023
cfa2169
switched to true and false strings
jgonz120 Dec 5, 2023
c1753f9
Added ctr to specify buffer for testing purposes.
jgonz120 Dec 5, 2023
cc2ad30
remove commented code
jgonz120 Dec 6, 2023
32ec713
switch to use Utf8 preamble for BOM
jgonz120 Dec 6, 2023
a9940e9
Create method for checking complete
jgonz120 Dec 6, 2023
6f87583
combined code for ReadStringArray
jgonz120 Dec 7, 2023
0f75860
Updated buffer size to match STJ's default buffer size
jgonz120 Dec 7, 2023
5c4269a
Switch Utf8JsonStreamReader to be disposable.
jgonz120 Dec 7, 2023
c469899
Switch to read the value for numbers into a string directly
jgonz120 Dec 7, 2023
d0f9f5e
revert back to using private var for utf8Bom
jgonz120 Dec 7, 2023
d3e6ab8
Remove ReadStringArrayAsList
jgonz120 Dec 7, 2023
2d7cba8
Avoid referencing buffer after returning
jgonz120 Dec 7, 2023
4192d9b
Actually avoid referencing _buffer after returning
jgonz120 Dec 7, 2023
f67239b
Update how buffers are fed into Utf8JsonReader to avoid feeding extra…
jgonz120 Dec 8, 2023
ca6e1d7
remove extra line
jgonz120 Dec 8, 2023
fa9639d
Reverted back to using try get int for ReadTokenAsString
jgonz120 Dec 11, 2023
997f199
Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs
jgonz120 Dec 11, 2023
3e4146c
Remove ValueTextEquals taking in string
jgonz120 Dec 11, 2023
b403ed8
Switched to Skip instead of TrySkip
jgonz120 Dec 11, 2023
a1c4844
Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs
jgonz120 Dec 11, 2023
4ff0f7e
Added some unit tests
jgonz120 Dec 11, 2023
a9884ec
merge
jgonz120 Dec 11, 2023
7a467d5
fix Bom
jgonz120 Dec 11, 2023
86d3524
Switched to using Moq
jgonz120 Dec 11, 2023
c559e69
Update src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs
jgonz120 Dec 11, 2023
74b2e54
loop through stream when reading to ensure reading full bytes or to t…
jgonz120 Dec 11, 2023
0c05eb8
update signature comment
jgonz120 Dec 12, 2023
a233b40
Switched stream back to field and supress warning
jgonz120 Dec 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions src/NuGet.Core/NuGet.ProjectModel/Utf8JsonReaderExtensions.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
// Copyright (c) .NET Foundation. All rights reserved.
// Licensed under the Apache License, Version 2.0. See License.txt in the project root for license information.

using System;
using System.Text.Json;

namespace NuGet.ProjectModel
{
internal static class Utf8JsonReaderExtensions
{
internal static string ReadTokenAsString(this ref Utf8JsonReader reader)
{
switch (reader.TokenType)
{
case JsonTokenType.True:
return bool.TrueString;
case JsonTokenType.False:
return bool.FalseString;
case JsonTokenType.Number:
return reader.ReadNumberAsString();
case JsonTokenType.String:
return reader.GetString();
case JsonTokenType.None:
case JsonTokenType.Null:
return null;
default:
throw new InvalidCastException();
}
}

private static string ReadNumberAsString(this ref Utf8JsonReader reader)
{
if (reader.TryGetInt64(out long value))
{
return value.ToString();
}
return reader.GetDouble().ToString();
}
}
}
273 changes: 273 additions & 0 deletions src/NuGet.Core/NuGet.ProjectModel/Utf8JsonStreamReader.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
// Copyright (c) .NET Foundation. All rights reserved.
// Licensed under the Apache License, Version 2.0. See License.txt in the project root for license information.

using System;
using System.Buffers;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.Json;

namespace NuGet.ProjectModel
{
/// <summary>
jgonz120 marked this conversation as resolved.
Show resolved Hide resolved
/// This struct is used to read over a memeory stream in parts, in order to avoid reading the entire stream into memory.
/// It functions as a wrapper around <see cref="Utf8JsonStreamReader"/>, while maintaining a stream and a buffer to read from.
/// </summary>
internal ref struct Utf8JsonStreamReader
Copy link
Contributor

@kartheekp-ms kartheekp-ms Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utf8JsonStreamReader struct appears to have dual responsibilities violating single responsibility principle.

Stream Wrapper: The struct acts as a wrapper around a stream, specifically for the purpose of reading it in chunks rather than loading the entire stream into memory.

Utf8JsonReader Wrapper: The struct also wraps the functionality of Utf8JsonReader. This includes functionalities like reading various data types (strings, integers, booleans) from JSON, handling different JSON token types, and managing the state of the JSON reader.

How about splitting it into 2 structs, one handles stream buffer, resizing the buffer etc., and another one wraps UTF8JsonReader methods such as GetString(), GetBoolean() etc.,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a large change if we want to do this I would like to do it in a new PR after the last one is complete.

{
private static readonly char[] DelimitedStringDelimiters = [' ', ','];
private static readonly byte[] Utf8Bom = [0xEF, 0xBB, 0xBF];

private const int BufferSizeDefault = 16 * 1024;
private const int MinBufferSize = 1024;
private Utf8JsonReader _reader;
#pragma warning disable CA2213 // Disposable fields should be disposed
private Stream _stream;
#pragma warning restore CA2213 // Disposable fields should be disposed
// The buffer is used to read from the stream in chunks.
private byte[] _buffer;
private bool _disposed;
private ArrayPool<byte> _bufferPool;
private int _bufferUsed = 0;

internal Utf8JsonStreamReader(Stream stream, int bufferSize = BufferSizeDefault, ArrayPool<byte> arrayPool = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor throws following exception when Array.Empty<byte>() is passed to the memory stream. I am just letting you know incase if you would like to add a test.

 using (var stream = new MemoryStream(Array.Empty<byte>()))
 using (var reader = new Utf8JsonStreamReader(stream))
{
}

System.Text.Json.JsonReaderException : '0x00' is an invalid start of a value. LineNumber: 0 | BytePositionInLine: 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to demonstrate this behavior but I don't think it's worthwhile adding validation for this.

{
if (stream is null)
{
throw new ArgumentNullException(nameof(stream));
}

if (bufferSize < MinBufferSize)
{
throw new ArgumentException($"Buffer size must be at least {MinBufferSize} bytes", nameof(bufferSize));
}

_bufferPool = arrayPool ?? ArrayPool<byte>.Shared;
_buffer = _bufferPool.Rent(bufferSize);
_disposed = false;
_stream = stream;
_stream.Read(_buffer, 0, 3);
if (!Utf8Bom.AsSpan().SequenceEqual(_buffer.AsSpan(0, 3)))
{
_bufferUsed = 3;
}

var iniialJsonReaderState = new JsonReaderState(new JsonReaderOptions
jgonz120 marked this conversation as resolved.
Show resolved Hide resolved
jgonz120 marked this conversation as resolved.
Show resolved Hide resolved
{
AllowTrailingCommas = true,
CommentHandling = JsonCommentHandling.Skip,
});

ReadStreamIntoBuffer(iniialJsonReaderState);
_reader.Read();
}

internal bool IsFinalBlock => _reader.IsFinalBlock;

internal JsonTokenType TokenType => _reader.TokenType;

internal bool ValueTextEquals(ReadOnlySpan<byte> utf8Text) => _reader.ValueTextEquals(utf8Text);

internal bool TryGetInt32(out int value) => _reader.TryGetInt32(out value);
jgonz120 marked this conversation as resolved.
Show resolved Hide resolved

internal string GetString() => _reader.GetString();

internal bool GetBoolean() => _reader.GetBoolean();

internal int GetInt32() => _reader.GetInt32();

internal bool Read()
{
ThrowExceptionIfDisposed();

bool wasRead;
while (!(wasRead = _reader.Read()) && !_reader.IsFinalBlock)
{
GetMoreBytesFromStream();
}
Comment on lines +85 to +88
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, which I currently don't understand, the STJ implementation raises an exception when IsFinalBlock is set to true. Therefore, it would be better to check this value first before invoking the underlying Read method. Another advantage is that reading IsFinalBlock is an O(1) operation.

Suggested change
while (!(wasRead = _reader.Read()) && !_reader.IsFinalBlock)
{
GetMoreBytesFromStream();
}
while (!_reader.IsFinalBlock && !(wasRead = _reader.Read()))
{
GetMoreBytesFromStream();
}

https://github.com/dotnet/runtime/blob/3a5bea5d60ea04b897ac968a358ca99a1189d368/src/libraries/System.Text.Json/src/System/Text/Json/Reader/Utf8JsonReader.cs#L269-L289

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we check for the final block first then we won't all the read at all. The exception you're seeing there means that the Utf8JsonReader was told it has all the data but the current property is none. None meaning that there is no JSON data in the reader. If we want to check for the scenario we can, I would think it should be in the constructor though not here.

return wasRead;
}

internal void Skip()
{
ThrowExceptionIfDisposed();

bool wasSkipped;
while (!(wasSkipped = _reader.TrySkip()) && !_reader.IsFinalBlock)
{
GetMoreBytesFromStream();
}
if (!wasSkipped)
{
_reader.Skip();
}
}

internal string ReadNextTokenAsString()
{
ThrowExceptionIfDisposed();

if (Read())
{
return _reader.ReadTokenAsString();
}

return null;
zivkan marked this conversation as resolved.
Show resolved Hide resolved
}

internal IList<string> ReadStringArrayAsIList(IList<string> strings = null)
jgonz120 marked this conversation as resolved.
Show resolved Hide resolved
{
if (TokenType == JsonTokenType.StartArray)
{
while (Read() && TokenType != JsonTokenType.EndArray)
{
string value = _reader.ReadTokenAsString();

strings = strings ?? new List<string>();
jgonz120 marked this conversation as resolved.
Show resolved Hide resolved

strings.Add(value);
}
}
return strings;
}

internal IReadOnlyList<string> ReadDelimitedString()
{
ThrowExceptionIfDisposed();

if (Read())
{
switch (TokenType)
{
case JsonTokenType.String:
var value = GetString();

return value.Split(DelimitedStringDelimiters, StringSplitOptions.RemoveEmptyEntries);

default:
var invalidCastException = new InvalidCastException();
throw new JsonException(invalidCastException.Message, invalidCastException);
}
}

return null;
}

internal bool ReadNextTokenAsBoolOrFalse()
jgonz120 marked this conversation as resolved.
Show resolved Hide resolved
{
ThrowExceptionIfDisposed();

if (Read() && (TokenType == JsonTokenType.False || TokenType == JsonTokenType.True))
{
return GetBoolean();
}
return false;
}

internal IReadOnlyList<string> ReadNextStringOrArrayOfStringsAsReadOnlyList()
{
ThrowExceptionIfDisposed();

if (Read())
{
switch (_reader.TokenType)
{
case JsonTokenType.String:
return new[] { (string)_reader.GetString() };

case JsonTokenType.StartArray:
return ReadStringArrayAsReadOnlyListFromArrayStart();

case JsonTokenType.StartObject:
return null;
}
}

return null;
}

internal IReadOnlyList<string> ReadStringArrayAsReadOnlyListFromArrayStart()
{
ThrowExceptionIfDisposed();

List<string> strings = null;

while (Read() && _reader.TokenType != JsonTokenType.EndArray)
{
string value = _reader.ReadTokenAsString();

strings = strings ?? new List<string>();

strings.Add(value);
}

return (IReadOnlyList<string>)strings ?? Array.Empty<string>();
}

// This function is called when Read() returns false and we're not already in the final block
private void GetMoreBytesFromStream()
{
if (_reader.BytesConsumed < _bufferUsed)
{
// If the number of bytes consumed by the reader is less than the amount set in the buffer then we have leftover bytes
var oldBuffer = _buffer;
ReadOnlySpan<byte> leftover = oldBuffer.AsSpan((int)_reader.BytesConsumed);
jgonz120 marked this conversation as resolved.
Show resolved Hide resolved
_bufferUsed = leftover.Length;

// If the leftover bytes are the same as the buffer size then we are at capacity and need to double the buffer size
if (leftover.Length == _buffer.Length)
{
_buffer = _bufferPool.Rent(_buffer.Length * 2);
leftover.CopyTo(_buffer);
_bufferPool.Return(oldBuffer, true);
}
else
{
leftover.CopyTo(_buffer);
}
}
else
{
_bufferUsed = 0;
}

ReadStreamIntoBuffer(_reader.CurrentState);
}

/// <summary>
/// Loops through the stream and reads it into the buffer until the buffer is full or the stream is empty, creates the Utf8JsonReader.
/// </summary>
private void ReadStreamIntoBuffer(JsonReaderState jsonReaderState)
{
int bytesRead;
do
{
var spaceLeftInBuffer = _buffer.Length - _bufferUsed;
bytesRead = _stream.Read(_buffer, _bufferUsed, spaceLeftInBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using the Stream.ReadAsync Method instead of the blocking Read method? It would be great if we could use the overload that accepts a CancellationToken.

Docs state that for an asynchronous version of example, see .NET samples JSON project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have links handy, but my understanding is that on .NET Framework at least, files without passing the async flag to the operating system, is that there's a non-trivial perf impact to reading the files with async APIs. Similarly, files opened (from the OS point of view) with the async flag, but then using blocking APIs has a non-trivial perf impact. Also, .NET Framework's async scheduling just isn't nearly as good as .NET 6+'s, so even when async file IO is done consistently, when the delays are low (OS has file cached in memory?) then there's still going to be a measurable perf impact.

If this suggestion, using Stream.ReadAsync, is being seriously considered, I'd recommend reaching out to the VS Perf team and checking if my believe about async file IO on .NET Framework is wrong.

Copy link
Contributor

@davkean davkean Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async isn't about making things faster; its about yielding thread pool threads while its blocking, which lets us make sure we're efficiently using the CPU by having only CPU work on them.

NuGet during restore spends a lot of time blocking thread pools threads (see Thread Pool watson, its #13, #14, #19, #22 in 17.8), so async where possible should be the approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davkean good to know. It's a shame that most of the thread pool watson buckets are caused by .NET & Windows APIs where there are no async alternatives. But we should make time to investigate async file reads (where APIs exist).

_bufferUsed += bytesRead;
}
while (bytesRead != 0 && _bufferUsed != _buffer.Length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the need for the do-while loop inside the ReadStreamIntoBuffer method, especially considering that ReadStreamIntoBuffer is already invoked within a while loop in the Read method. If the buffer is smaller than the stream's content, or if the buffer has more capacity than what's currently available in the stream, the first call to _stream.Read in ReadStreamIntoBuffer should be enough to fill the buffer up to its capacity or the stream's current availability. Is the do-while loop necessary in this case, or could it lead to redundant read operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in a loop because of Andy's comment here: #5530 (comment).

There isn't any guarantee that the stream is actually read to the limit requested. So the loop is necessary to ensure we are reading until the buffer is full.

_reader = new Utf8JsonReader(_buffer.AsSpan(0, _bufferUsed), isFinalBlock: bytesRead == 0, jsonReaderState);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://learn.microsoft.com/dotnet/standard/serialization/system-text-json/use-utf8jsonreader doc suggests to use _reader.CurrentState. If this suggestion is valid for our scenario, then we can remove the jsonReaderState parameter from the method definition.

Suggested change
_reader = new Utf8JsonReader(_buffer.AsSpan(0, _bufferUsed), isFinalBlock: bytesRead == 0, jsonReaderState);
_reader = new Utf8JsonReader(_buffer.AsSpan(0, _bufferUsed), isFinalBlock: bytesRead == 0, _reader.CurrentState);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is taking in the jsonReaderState because in the constructor the state doesn't already exist.

}

public void Dispose()
{
if (!_disposed)
{
_disposed = true;
byte[] toReturn = _buffer;
_buffer = null!;
_bufferPool.Return(toReturn, true);
}
}

private void ThrowExceptionIfDisposed()
{
if (_disposed)
{
throw new ObjectDisposedException(nameof(Utf8JsonStreamReader));
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
// Copyright (c) .NET Foundation. All rights reserved.
// Licensed under the Apache License, Version 2.0. See License.txt in the project root for license information.
namespace NuGet.ProjectModel
{
/// <summary>
/// An abstract class that defines a function for reading a <see cref="Utf8JsonStreamReader"/> into a <typeparamref name="T"/>
/// </summary>
/// <typeparam name="T"></typeparam>
internal abstract class Utf8JsonStreamReaderConverter<T>
{
public abstract T Read(ref Utf8JsonStreamReader reader);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,6 @@ public void LockFileFormat_ReadsLockFileWithNoTools()

var target = lockFile.Targets.Single();
Assert.Equal(NuGetFramework.Parse("dotnet"), target.TargetFramework);

var runtimeTargetLibrary = target.Libraries.Single();
Assert.Equal("System.Runtime", runtimeTargetLibrary.Name);
Assert.Equal(NuGetVersion.Parse("4.0.20-beta-22927"), runtimeTargetLibrary.Version);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
// Copyright (c) .NET Foundation. All rights reserved.
// Licensed under the Apache License, Version 2.0. See License.txt in the project root for license information.

using System.Text;
using System.Text.Json;
using Xunit;

namespace NuGet.ProjectModel.Test
{
[UseCulture("")] // Fix tests failing on systems with non-English locales
public class Utf8JsonReaderExtensionsTests
{
[Theory]
[InlineData("null", null)]
[InlineData("true", "True")]
[InlineData("false", "False")]
[InlineData("-2", "-2")]
[InlineData("9223372036854775807", "9223372036854775807")]
[InlineData("3.14", "3.14")]
[InlineData("\"b\"", "b")]
public void ReadTokenAsString_WhenValueIsConvertibleToString_ReturnsValueAsString(
string value,
string expectedResult)
{
var json = $"{{\"a\":{value}}}";
var encodedBytes = Encoding.UTF8.GetBytes(json);
var reader = new Utf8JsonReader(encodedBytes);
reader.Read();
reader.Read();
reader.Read();
string actualResult = reader.ReadTokenAsString();
Assert.Equal(expectedResult, actualResult);
}
}
}
Loading