-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tar: Fix PAX regression when handling the size of really long unseekable data streams #88280
Tar: Fix PAX regression when handling the size of really long unseekable data streams #88280
Conversation
Tagging subscribers to this area: @dotnet/area-system-formats-tar Issue DetailsRegression introduced by #84279 That fix refactored the code so that unseekable data streams could be properly handled. Unfortunately, I didn't execute Outerloop locally, and apparently they also didn't get executed in my PR. If that had happened, we would've seen a failure happening in the test named That test creates a >8GB tar file on disk (no other way since we cannot create In debug runs, the test fails in a Debug.Assert that checks both the existence of the size key in the extended attributes dictionary, and also that it gets converted to octal to the right value. In release runs, since that Debug.Assert is skipped, the test times out because the reader is unable to advance to the next entry due to the fact that the data stream size is detected as zero. The root cause is that in PAX, the size standard header field was written too late: We were first collecting all the extended attributes that needed to be obtained from standard fields, then we were writing the extended attributes entry, then we were reading the unseekable data stream to retrieve the size (too late), and then we were writing the actual entry and its data stream. The code refactoring reuses code for all the entry formats, but for PAX the order of actions should be different. The fix consists in separate the writing of entries much more than in the original refactoring: one path for seekable data streams, and another path for unseekable data streams. For V7 and UStar, choosing one path or the other is very similar and very simple. For GNU, choosing the path only varies in that the long link or long name entry needs to be written first, and then we choose a path like we do with V7 and UStar. But with PAX, we need to first collect the size of the data stream, then collect the extended attributes (which will include the too-long size value), then write the extended attributes entry, and finally write the actual entry with its data stream. I ran all tests locally, including the Outerloop ones, and they all pass. NOTE: The
|
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
The long running test failure is still happening, but that's expected, I can even repro locally. Do we have a unit test attribute to indicate the test is long running? I am not sure what other alternatives we have. MemoryStream does not allow handling 8GB. I am unsure if we are allowed to sture a file that large in runtime-assets. |
We can make a custom flavor MemoryStream presumably, it could even fake its content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone knows how to do this in-memory, please let me know.
MemoryStream
is backed by array of bytes, which has the size limit.
The alternative is to use a stream backed by native memory: UnmanagedMemoryStream which is exposed by System.Runtime (so it should work fine here).
Or as @danmoseley has suggested, you could fake it by for example implementing a stream that repeats the same data N times when reading from it.
this already does something similar, perhaps it can be adapted/copied. if it's a general purpose test utility you could choose to put it in src\libraries\Common\tests\System\IO |
… really long data streams to get its size correctly stored in the extended attributes when the data stream is unseekable.
868a661
to
5482008
Compare
@ericstj @jeffhandley @ViktorHofer as discussed in our meeting, I moved the long file tests to a manual tests project. I opened #89037 to track adding a similar test that runs on CI and simulates writing/reading 8GB file entries. |
The change from #88280 caused an issue when discovering the test: ``` System.InvalidOperationException : An appropriate member 'ManualTestsEnabled' could not be found. The conditional method needs to be a static method, property, or field on the type System.Formats.Tar.Tests.ManualTestsAsync or any ancestor, of any visibility, accepting zero arguments, and having a return type of Boolean. ```
The change from #88280 caused an issue when discovering the test: ``` System.InvalidOperationException : An appropriate member 'ManualTestsEnabled' could not be found. The conditional method needs to be a static method, property, or field on the type System.Formats.Tar.Tests.ManualTestsAsync or any ancestor, of any visibility, accepting zero arguments, and having a return type of Boolean. ```
Fixes: #88271
Fixes: #87359
Regression introduced by #84279
That fix refactored the code so that unseekable data streams could be properly handled. Unfortunately, I didn't execute Outerloop locally, and apparently they also didn't get executed in my PR. If that had happened, we would've seen a failure happening in the test named
System.Formats.Tar.Tests.TarWriter_WriteEntry_LongFile_Tests.WriteEntry_LongFileSize
.That test creates a >8GB tar file on disk (no other way since we cannot create
MemoryStream
instances that large) with PAX entries, then tries to load it in a TarReader, expecting the size of the large entry to have the correct size.In debug runs, the test fails in a Debug.Assert that checks both the existence of the size key in the extended attributes dictionary, and also that it gets converted to octal to the right value.
In release runs, since that Debug.Assert is skipped, the test times out because the reader is unable to advance to the next entry due to the fact that the data stream size is detected as zero.
The root cause is that in PAX, the size standard header field was written too late: We were first collecting all the extended attributes that needed to be obtained from standard fields, then we were writing the extended attributes entry, then we were reading the unseekable data stream to retrieve the size (too late), and then we were writing the actual entry and its data stream.
The code refactoring reuses code for all the entry formats, but for PAX the order of actions should be different.
The fix consists in separate the writing of entries much more than in the original refactoring: one path for seekable data streams, and another path for unseekable data streams.
For V7 and UStar, choosing one path or the other is very similar and very simple.
For GNU, choosing the path only varies in that the long link or long name entry needs to be written first, and then we choose a path like we do with V7 and UStar.
But with PAX, we need to first collect the size of the data stream, then collect the extended attributes (which will include the too-long size value), then write the extended attributes entry, and finally write the actual entry with its data stream.
I ran all tests locally, including the Outerloop ones, and they all pass.
NOTE: The
WriteEntry_LongFileSizeAsync
is showing up asLong Running Test
locally, but it passes (2-4 min to finish). I am unsure how to speed it up, considering that we need to test an >8GB file on disk. If someone knows how to do this in-memory, please let me know. If the test fails in my PR, I'd like to log a separate issue and investigate it separately.