Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for exotic external tar asset archives, fix some more corner case bugs #74412

Merged
merged 10 commits into from
Aug 23, 2022

Conversation

carlossanlop
Copy link
Member

@carlossanlop carlossanlop commented Aug 23, 2022

Addresses #74316 (excludes libarchive, which were not added as assets)

src changes:

  • Remove unused _readFirstEntry field in TarReader.
  • Add processDataBlock bool argument to TryGetNextHeader so that in the case of Pax, we don't read the data block until after we process the extended attributes, in case size is encountered among them and we override the original value.
  • Throw more clearly when encountering a sparse entry. We won't support it for 7.0.
  • Throw when encountering a size field > 0 in entries where data is not expected.
  • When copying data, make sure that the freshly created memory stream is rewinded so that the user's first call of DataStream returns a stream positioned in the first byte, instead of the last byte.
  • For octal fields, trim leading nulls and whitespaces. We already trim trailing nulls and whitespaces. @am11 Allow spaces in octal attributes #74358 my changes would conflict a lot with yours. I kept the change as simple and clear as possible, and only confined it to those two characters.

test changes:

… user access of the DataStream property gives them a stream ready to read from the beginning.
…butes are analyzed, in case the size is found as an extended attribute and needs to be overriden.
… read. Verify their DataStream if available.
…s encountered. These entries have additional data that indicates the locations of sparse bytes, which cannot be read with just the size field. So to avoid accidentally offseting the reader, we throw.
@ghost
Copy link

ghost commented Aug 23, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #74316

src changes:

  • Remove unused _readFirstEntry field in TarReader.
  • Add processDataBlock bool argument to TryGetNextHeader so that in the case of Pax, we don't read the data block until after we process the extended attributes, in case size is encountered among them and we override the original value.
  • Throw more clearly when encountering a sparse entry. We won't support it for 7.0.
  • Throw when encountering a size field > 0 in entries where data is not expected.
  • When copying data, make sure that the freshly created memory stream is rewinded so that the user's first call of DataStream returns a stream positioned in the first byte, instead of the last byte.
  • For octal fields, trim trailing nulls and whitespaces. We already trim trailing nulls and whitespaces. @am11 Allow spaces in octal attributes #74358 my changes would conflict a lot with yours. I kept the change as simple and clear as possible, and only confined it to those two characters.

test changes:

Author: carlossanlop
Assignees: carlossanlop
Labels:

area-System.IO

Milestone: -

@danmoseley
Copy link
Member

For octal fields, trim trailing nulls and whitespaces. We already trim trailing nulls and whitespaces

Leading?

@danmoseley
Copy link
Member

Can you also check how we fare against the libarchive ones now?

@am11
Copy link
Member

am11 commented Aug 23, 2022

my changes would conflict a lot with yours.

What was the reason of handling it here when the other PR is up?

@danmoseley
Copy link
Member

Throw when encountering a size field > 0 in entries where data is not expected.

Is this the only case where this PR make us stricter than before? We're confident this isn't a real world case?

@carlossanlop
Copy link
Member Author

What was the reason of handling it here when the other PR is up?

@am11 I wanted to cover tests for all the individual archives that you were kind enough to bring to our repo. I had to write a lot of helper code for the tests that would make it complicated to merge with your change. Since we want to include this change in the rc1 backport, I thought the fastest way to address the octal bug was to include it here, and request for your help, which is greatly appreciated. I'll make sure to give credit to you in the commit message.

Apologies for the last minute notice. Would you be so kind to provide feedback on the change? In my opinion, we should keep it very limited to only nulls and whitespaces.

@carlossanlop
Copy link
Member Author

carlossanlop commented Aug 23, 2022

Is this the only case where this PR make us stricter than before? We're confident this isn't a real world case?

Good question. I only encountered this problem in one of the archives, in a fifo entry, which is an entry type in which the data section is unexpected.

The code was already not expecting data for those kinds of entries, but it would keep running until it tried to retrieve the next entry, and would encounter malformed data, because the position pointer would be offset (the pointer was not advancing beyond the unexpected data and its padding).

So I'd rather throw early than unexpectedly throw FormatException later, which would confuse the user.

This is the only case that would make us stricter, among these changes.

@carlossanlop
Copy link
Member Author

Can you also check how we fare against the libarchive ones now?

Yes, I can do that next. We don't yet have them in runtime-assets, so I am not sure I'll have time to include it in this PR or in RC1.

@@ -203,6 +203,12 @@ internal static TarEntryType GetCorrectTypeFlagForFormat(TarEntryFormat format,
internal static T ParseOctal<T>(ReadOnlySpan<byte> buffer) where T : struct, INumber<T>
{
buffer = TrimEndingNullsAndSpaces(buffer);
buffer = TrimLeadingNullsAndSpaces(buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: either {Ending,Starting} or {Trailing,Leading} pair

@@ -205,6 +205,9 @@
<value>The entry is a symbolic link or a hard link but the LinkName field is null or empty.</value>
</data>
<data name="TarEntryTypeNotSupported" xml:space="preserve">
<value>Entry type '{0}' not supported.</value>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<value>Entry type '{0}' not supported.</value>
<value>Entry type '{0}' is not supported.</value>

And below

@@ -203,6 +203,12 @@ internal static TarEntryType GetCorrectTypeFlagForFormat(TarEntryFormat format,
internal static T ParseOctal<T>(ReadOnlySpan<byte> buffer) where T : struct, INumber<T>
{
buffer = TrimEndingNullsAndSpaces(buffer);
buffer = TrimLeadingNullsAndSpaces(buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, most tools I have tried only ignore nulls and spaces. I haven't found a spec explicitly calling it out how lenient the implementation can or should be, but libarchive's behavior is that it ignores all non-octal characters.

Maybe add a comment linking to formal specification which talks about it, or if there is none:
"The formal specification does not specify it but we only trim nulls and spaces"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @am11. Sure, I can add a comment. If I don't get any blocking feedback to address, I'll add the comment in a separate PR to not block this with a CI re-run.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @carlossanlop !

I left some comments, but they were all nits that don't need to be fixed right now if you want to merge it asap and backport to RC1.

@@ -311,6 +319,8 @@ private async Task ProcessDataBlockAsync(Stream archiveStream, bool copyData, Ca
{
MemoryStream copiedData = new MemoryStream();
TarHelpers.CopyBytes(archiveStream, copiedData, _size);
// Reset position pointer so the user can do the first DataStream read from the beginning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: slightly better wording (don't trigger CI for this suggestion)

Suggested change
// Reset position pointer so the user can do the first DataStream read from the beginning
// Reset stream position so the user can do the first DataStream read from the beginning

also: do other usages of TarHelpers.CopyBytes also need to reset the position? If so, we could move it to this method itself.

Comment on lines +253 to +261
{
int newStart = 0;
while (newStart < buffer.Length && buffer[newStart] is 0 or 32)
{
newStart++;
}

return buffer.Slice(newStart);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use the new IndexOfAnyExcept here

Suggested change
{
int newStart = 0;
while (newStart < buffer.Length && buffer[newStart] is 0 or 32)
{
newStart++;
}
return buffer.Slice(newStart);
}
=> buffer.Slice(Math.Max(0, buffer.IndexOfAnyExcept((byte)0, (byte)' '));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a nit, I can address this in a separate PR.

Assert.NotNull(await reader.GetNextEntryAsync());
Assert.NotNull(await reader.GetNextEntryAsync());
Assert.NotNull(await reader.GetNextEntryAsync());
Assert.NotNull(await reader.GetNextEntryAsync());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: personally, I would use for loop here and add a comment explaining why we expect 8 entries

Suggested change
Assert.NotNull(await reader.GetNextEntryAsync());
for (int i = 0; i < 8; i++)
{
Assert.NotNull(await reader.GetNextEntryAsync());
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That approach works great when the test passes. It's way less great when it fails, since you don't know what i was.

Comment on lines +568 to +573
{
foreach (string name in NodeTarTestCaseNames)
{
yield return new object[] { name };
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could use LINQ here

Suggested change
{
foreach (string name in NodeTarTestCaseNames)
{
yield return new object[] { name };
}
}
=> NodeTarTestCaseNames.Select(name => new object[] { name });

@@ -205,6 +205,9 @@
<value>The entry is a symbolic link or a hard link but the LinkName field is null or empty.</value>
</data>
<data name="TarEntryTypeNotSupported" xml:space="preserve">
<value>Entry type '{0}' not supported.</value>
</data>
<data name="TarEntryTypeNotSupportedInFormat" xml:space="preserve">
<value>Entry type '{0}' not supported in format '{1}'.</value>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<value>Entry type '{0}' not supported in format '{1}'.</value>
<value>Entry type '{0}' is not supported in format '{1}'.</value>

{
// Currently sparse entries are not supported.

// There are PAX archives archives in the golang folder that have extended attributes for treating a regular file as a sparse file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"archives archives"?

{
// The four supported formats have a header that fits in the default record size
Span<byte> buffer = stackalloc byte[TarHelpers.RecordSize];

archiveStream.ReadExactly(buffer);

TarHeader? header = TryReadAttributes(initialFormat, buffer);
if (header != null)
if (header != null && processDataBlock)
Copy link
Member

@jozkee jozkee Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this better than one extra param?

Suggested change
if (header != null && processDataBlock)
if (header != null && initialFormat != TarEntryFormat.Pax)

Copy link
Member Author

@carlossanlop carlossanlop Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. That was my initial approach, but then I opted for the extra boolean for these reasons:

  • We have multiple types of PAX entries: Global Extended Attriburtes, Extended Attributes, and the normal entries. I was introducing regressions in some of those cases.
  • It's much more clear what the boolean is for, rather than a not-equal check.

{
await using MemoryStream archiveStream = GetTarMemoryStream(CompressionMethod.Uncompressed, "golang_tar", testCaseName);
await using TarReader reader = new TarReader(archiveStream);
await Assert.ThrowsAsync<FormatException>(async () => await reader.GetNextEntryAsync());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
await Assert.ThrowsAsync<FormatException>(async () => await reader.GetNextEntryAsync());
await Assert.ThrowsAsync<FormatException>(() => reader.GetNextEntryAsync().AsTask());


await using MemoryStream archiveStream = GetTarMemoryStream(CompressionMethod.Uncompressed, testFolderName, testCaseName);
await using TarReader reader = new TarReader(archiveStream);
await Assert.ThrowsAsync<NotSupportedException>(async () => await reader.GetNextEntryAsync());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to workaround an archive with sparse files? say, we still want to traverse the rest of entries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no. Note that sparse files are a rare scenario, and from the beginning it was decided that archives containing these would not be supported for 7.0.

@carlossanlop carlossanlop merged commit c5a0e55 into dotnet:main Aug 23, 2022
@carlossanlop carlossanlop deleted the ExtAttrSizeDataBlock branch August 23, 2022 18:14
carlossanlop added a commit to carlossanlop/runtime that referenced this pull request Aug 23, 2022
…r case bugs (dotnet#74412)

* Remove unused _readFirstEntry. Remnant from before we created PaxGlobalExtendedAttributesEntry.

* Set the position of the freshly copied data stream to 0, so the first user access of the DataStream property gives them a stream ready to read from the beginning.

* Process a PAX actual entry's data block only after the extended attributes are analyzed, in case the size is found as an extended attribute and needs to be overriden.

* Add tests to verify the entries of the new external tar assets can be read. Verify their DataStream if available.

* Add copyData argument to recent alignment padding tests.

* Throw an exception sooner and with a clearer message when a data section is unexpected for the entry type.

* Allow trailing nulls and spaces in octal fields.
Co-authored-by: @am11 Adeel Mujahid <3840695+am11@users.noreply.github.com>

* Throw a clearer exception if the unsupported sparse file entry type is encountered. These entries have additional data that indicates the locations of sparse bytes, which cannot be read with just the size field. So to avoid accidentally offseting the reader, we throw.

* Tests.

* Rename to TrimLeadingNullsAndSpaces

Co-authored-by: carlossanlop <carlossanlop@users.noreply.github.com>
Comment on lines +254 to +255
AssertExtensions.GreaterThan(entry.Checksum, 0);
AssertExtensions.GreaterThan((int)entry.Mode, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test where we expext zero? i.e: when the bytes passed into ParseOctal are all leading spaces.

carlossanlop added a commit to carlossanlop/runtime that referenced this pull request Aug 23, 2022
…r case bugs (dotnet#74412)

* Remove unused _readFirstEntry. Remnant from before we created PaxGlobalExtendedAttributesEntry.

* Set the position of the freshly copied data stream to 0, so the first user access of the DataStream property gives them a stream ready to read from the beginning.

* Process a PAX actual entry's data block only after the extended attributes are analyzed, in case the size is found as an extended attribute and needs to be overriden.

* Add tests to verify the entries of the new external tar assets can be read. Verify their DataStream if available.

* Add copyData argument to recent alignment padding tests.

* Throw an exception sooner and with a clearer message when a data section is unexpected for the entry type.

* Allow trailing nulls and spaces in octal fields.
Co-authored-by: @am11 Adeel Mujahid <3840695+am11@users.noreply.github.com>

* Throw a clearer exception if the unsupported sparse file entry type is encountered. These entries have additional data that indicates the locations of sparse bytes, which cannot be read with just the size field. So to avoid accidentally offseting the reader, we throw.

* Tests.

* Rename to TrimLeadingNullsAndSpaces

Co-authored-by: carlossanlop <carlossanlop@users.noreply.github.com>
carlossanlop added a commit that referenced this pull request Aug 24, 2022
* Improve performance of Tar library (#74281)

* Avoid unnecessary byte[] allocations

* Remove unnecessary use of FileStreamOptions

* Clean up Dispose{Async} implementations

* Clean up unnecessary consts

Not a perf thing, just readability.

* Remove MemoryStream/Encoding.UTF8.GetBytes allocations, unnecessary async variants, and overhaul GenerateExtendedAttributesDataStream

* Avoid string allocations in ReadMagicAttribute

* Avoid allocation in WriteAsOctal

* Improve handling of octal

* Avoid allocation for version string

* Removing boxing and char string allocation in GenerateExtendedAttributeName

* Fix a couple unnecessary dictionary lookups

* Replace Enum.HasFlag usage

* Remove allocations from Write{Posix}Name

* Replace ArrayPool use with string.Create

* Replace more superfluous ArrayPool usage

* Remove ArrayPool use from System.IO.Compression.ZipFile

* Fix inverted condition

* Use generic math to parse octal

* Remove allocations from StringReader and string.Split

* Remove magic string allocation for Ustar when not V7

* Remove file name and directory name allocation in GenerateExtendedAttributeName

* fix tar strings (#74321)

* Fix some corner cases in TarReader (#74329)

* Fix a few Tar issues post perf improvements (#74338)

* Fix a few Tar issues post perf improvements

* Update src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

* Skip directory symlink recursion on TarFile archive creation (#74376)

* Skip directory symlink recursion on TarFile archive creation

* Add symlink verification

* Address suggestions by danmoseley

Co-authored-by: carlossanlop <carlossanlop@users.noreply.github.com>

* SkipBlockAlignmentPadding must be executed for all entries (#74396)

* Set modified timestamps on files being extracted from tar archives (#74400)

* Add tests for exotic external tar asset archives, fix some more corner case bugs (#74412)

* Remove unused _readFirstEntry. Remnant from before we created PaxGlobalExtendedAttributesEntry.

* Set the position of the freshly copied data stream to 0, so the first user access of the DataStream property gives them a stream ready to read from the beginning.

* Process a PAX actual entry's data block only after the extended attributes are analyzed, in case the size is found as an extended attribute and needs to be overriden.

* Add tests to verify the entries of the new external tar assets can be read. Verify their DataStream if available.

* Add copyData argument to recent alignment padding tests.

* Throw an exception sooner and with a clearer message when a data section is unexpected for the entry type.

* Allow trailing nulls and spaces in octal fields.
Co-authored-by: @am11 Adeel Mujahid <3840695+am11@users.noreply.github.com>

* Throw a clearer exception if the unsupported sparse file entry type is encountered. These entries have additional data that indicates the locations of sparse bytes, which cannot be read with just the size field. So to avoid accidentally offseting the reader, we throw.

* Tests.

* Rename to TrimLeadingNullsAndSpaces

Co-authored-by: carlossanlop <carlossanlop@users.noreply.github.com>

* Remove Compression changes, keep changes confined to Tar.

* Fix build failure due to missing using in TarHelpers.Windows

Co-authored-by: Stephen Toub <stoub@microsoft.com>
Co-authored-by: Dan Moseley <danmose@microsoft.com>
Co-authored-by: Adeel Mujahid <3840695+am11@users.noreply.github.com>
Co-authored-by: carlossanlop <carlossanlop@users.noreply.github.com>
Co-authored-by: David Cantú <dacantu@microsoft.com>
@ghost ghost locked as resolved and limited conversation to collaborators Sep 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants