-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TarReader throws on various archives that other tools accept #74316
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsI tried opening each of the tar files used to test Golang's tar package (here with details about each in the tests here) test code I used// See https://aka.ms/new-console-template for more information
using System.Formats.Tar;
using Xunit;
public static class C
{
public async static Task Main()
{
List<Task> tasks = new();
foreach (string path in Directory.EnumerateFiles(@"C:\git\go\src\archive\tar\testdata", "*.tar"))
{
tasks.Add(Task.Run(async () =>
{
TarEntry? entry = null;
try
{
//Console.WriteLine($"{path} opening...");
using FileStream fs = new(path, FileMode.Open);
using TarReader reader = new(fs, leaveOpen: false);
while ((entry = await reader.GetNextEntryAsync()) != null)
{
var ms = new MemoryStream();
Assert.NotEmpty(entry.Name);
Assert.True(Enum.IsDefined(entry.EntryType));
Assert.True(Enum.IsDefined(entry.Format));
if (entry.EntryType == TarEntryType.Directory)
continue;
var ds = entry.DataStream;
if (ds != null && ds.Length > 0)
{
ds.CopyTo(ms);
}
}
}
catch (Exception ex) //when (!(ex is FormatException))
{
Console.WriteLine($"{path} opening {entry?.Name} threw {ex.Message}");
}
}));
}
await Task.WhenAll(tasks);
}
}
I ignored all the ones that opened successfully and assumed that was expected. Here are the failures:
Possibly some of these are expected limitations, but for the others we should add checkboxes and work through and fix them.
|
Aside, I noticed this resource string does not have a format position, so it drops the path: if (!File.Exists(sourceFileName))
{
throw new FileNotFoundException(string.Format(SR.IO_FileNotFound, sourceFileName));
} another example
|
What do the empty cells in the table mean? |
Also, as @jaredpar mentioned in an old blogpost using File.Exists before opening the file is unreliable due to FS changes from other sources. |
@MichalPetryka it means I didn't try it.
Jared's point is something like: even if you check File.Exists, you still need to handle the possibility of the file not existing a moment later when you try to read it. In this case, if such a thing happened, an exception would be thrown to the caller which is fine. Yes, as matter of style, or maximum efficiency, we could catch FileNotFoundException instead, so long as the message was just as good. |
Added node and libarchive results. Thanks @am11 for pointing out all 3 of these. The other ones he found are GPL. We can probably still test that we can open those, we just can't copy them for our test assets, however rather than reason about that I'll leave them for now. |
Here is another corpus under permissive license. I don't have time to run the code above on those too, but we should do that after we fix the bugs above. |
One of them is failing like this:
|
This delta fixes all issues with node-tar fixtures: main...am11:runtime:feature/system.formats.tar/hardlinks-support (working on tests). |
Some in here (mostly compressed, would have to be decompressed, so perhaps not exercising tar much) https://github.com/adamhathcock/sharpcompress/tree/master/tests/TestArchives/Archives |
A few from OpenBSD, uuencoded and BSD license - https://github.com/openbsd/src/tree/master/regress/bin/pax |
For tar.gz and .tgz, we can use GZipStream in our runner. For .xz and .7z, we don't have OOTB support yet (#1542), so I think we can use a 3p library in tests. I will add a few files from node-tar which failed to parse with current |
Compression is a strict wrapper over tar, right? So if there are any useful tar's we want to test with (and license allows) that are tar.gz or whatever, we can simply uncompress them into .tar before putting them in the test assets? |
BTW, as mentioned above I largely ignored tar files that successfully opened, although these may suggest missing positive cases. That's why I put empty.tar in the list above, although I now see we do have a test for that ( |
Agreed. Additionally, the compression method used isn't too important. What really matters, if we want to emulate a compressed archive, is to make the archive stream unseekable, and we can achieve this by wrapping the main stream with a WrappedStream constructed with CanSeek=false, and pass that to the TarReader. |
I ran the same test over the files in https://github.com/alexcrichton/tar-rs/tree/master/tests/archives and got 3 failures: 7z_long_path.tar -- Entry '././@LongLink' was expected to be in the GNU format, but did not have the expected version data. empty_filename looks like it's already in the table, not sure about the other two. I'll add all of those (not just the failures) to runtime-assets as well. |
This is great, @am11, thank you so much for working on the fix! Ping me when a PR is out, we would want to get this backported to rc1. @danmoseley I noticed a few lines in that diff (SequenceEquals calls) would depend on the recent perf improvements PR that has pending approval for backporting to rc1. Mentioning this in case we can consider this fix's dependency as an additional argument in favor to backport the perf PR. Otherwise, this fix would have to be adjusted as a backport. |
FWIW, I have tested libarchive and GNU tar(1) and they both extract those five tar files successfully. GNU tar warns about the last one (but still extracts the contents successfully):
I have not tested those files with rust and node, but based on previous conversation, if majority of tools are OK with these misalignments, we should tolerate them as well. |
And 7zip: |
@MichalPetryka can you share the exception messages? |
|
Oh apologies. I missed that you said 7-zip. I thought those were errors on the .NET APIs. Thanks for sharing. |
@carlossanlop do you feel we must do more work here for 7.0? It seems to me we are in a reasonable place based on the data we have. These test tar's are synthetic after all. I suggest to move this to Future. We should certainly pick up the libarchive tar's if they let us know it's OK. |
No more work for 7.0 in my opinion, @danmoseley. Thanks for moving the milestone. |
How confident are we in the opposite direction, that tars produced by TarWriter are consumable by all commonly-used tools? Do we have tests for that direction, e.g. generate various outputs with TarWriter, shell out to |
FWIW, I've run into a real-world scenario of a tarball that can't be read with .NET runtime 7.0.0. I'm attempting to read the tarball for a Fedora container image layer. It fails with this callstack:
Here's my repro: using System.Formats.Tar;
var client = new HttpClient();
var message = await client.GetAsync(
"https://registry.fedoraproject.org/v2/fedora/blobs/sha256:7a05f01240abe225dc6c0178dd0fa67874478dadaba59f33efde33bcfb242d93");
var layerStream = await message.Content.ReadAsStreamAsync();
using var layerReader = new TarReader(layerStream);
while (true)
{
var layerEntry = layerReader.GetNextEntry();
if (layerEntry is null)
{
return;
}
} |
We had the following patch in #74358, which was matching BSD & GNU tar(1) as well as libarchive's behavior: --- a/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
+++ b/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
@@ -221,6 +221,11 @@ internal static TarEntryType GetCorrectTypeFlagForFormat(TarEntryFormat format,
buffer = TrimEndingNullsAndSpaces(buffer);
buffer = TrimLeadingNullsAndSpaces(buffer);
+ // skip leading non-octal bytes
+ int offset = 0;
+ for (; offset < buffer.Length && (buffer[offset] < (byte)'0' || buffer[offset] > (byte)'7'); ++offset);
+ buffer = buffer.Slice(offset);
+
if (buffer.Length == 0)
{
return T.Zero; it was rejected because it seemed too permissive (going by the standard; which suggests to only ignore 0 and 32 from ASCII table; Fedora image has ACK (ASCII 06) at index 0...) |
@carlossanlop @jozkee @jeffhandley should we reconsider our approach? |
Not without understanding why the Fedora image has an ACK there. Otherwise, the correct thing to patch would be whatever generated that tar, not .NET. |
@omajid may be of assistance here in tracking down what tool is used to produce the tarball. I saw the same issue with CentOS, which is in the same distro family as Fedora. @omajid - see my post here: #74316 (comment) |
From what I can tell, what Fedora does goes like this:
Here's a build log that shows all the steps except the last one: https://kojipkgs.fedoraproject.org//packages/Fedora-Container-Base/37/20221110.n.0/data/logs/image/oz-x86_64.log The underlying tool seems to be https://github.com/redhat-imaging/imagefactory. I can't see much more because the website is dead: http://imgfac.org/ I am not terribly familiar with this part of Fedora, so I might have some details wrong. |
This doc suggests that there is also a constraint on length of octal digits (six, seven, eleven depending on format, when it was implementation -- pre/post POSIX 1988 standard and the platform). We don't check for length at all, so we are lenient as-is. Following what libarchive does for the octal (find first octal digit in header, read until the first non-octal digit or end of buffer), it won't be setting new precedence. Our |
Hey @mthalman I just ran into the unable to parse number issue as well using .NET's http client Apparently the stream is being gziped twice I got a clue from here: icsharpcode/SharpZipLib#514 In my case I was doing a similar thing: task {
use! str = http.GetStreamAsync url
// System.IO.InvalidDataException: Unable To Parse Number
TarFile.ExtractToDirectory(source, output, false)
} Wrapping the stream from the http client made it work task {
use! str = http.GetStreamAsync url
// It works
use outer = new GZipStream(source, CompressionMode.Decompress)
TarFile.ExtractToDirectory(outer, output, false)
} It might be worth checking if it is a similar issue :) Edit: I'm not sure if this was a particular case of the .NET's HttpClient or a gzip gzip from node themselves I just know wrapping it made it work 😅 |
If that's common, I could imagine Tar giving a hint in the error, based on the Gzip magic number. |
Another real-life example of the "Unable to parse number" exception: https://github.com/ImageOptim/gifski/releases/download/1.11.0/gifski-1.11.0.tar.xz
Wrapping in a |
@NickeManarin the We don't yet support the LZMA algorithm in A very easy workaround is to import using SharpCompress.Compressors.Xz;
using System.Formats.Tar;
using System.IO;
class CSharpTestClass
{
static void Main()
{
string tarXzArchivePath = @"D:\Downloads\gifski-1.11.0.tar.xz";
string destinationDirectoryPath = @"D:\Downloads\extractedxz";
if (!Directory.Exists(destinationDirectoryPath))
{
Directory.CreateDirectory(destinationDirectoryPath);
}
using FileStream file = File.Open(tarXzArchivePath, FileMode.Open);
using XZStream xzStream = new(file);
TarFile.ExtractToDirectory(xzStream, destinationDirectoryPath, overwriteFiles: false);
}
} Hope that helps! |
@carlossanlop Thanks, that worked! |
Would we accept a change that, on failure to decompress, included the compression format in the message (by looking at magic numbers presumably). That might help in cases like this. It could just be best effort. |
I agree, we could add logic to TarReader to detect a compressed archive by reading the magic numbers. I opened issue #89056 to track that request specifically. I would like people looking for the error This issue can be closed since it was tracking a different problem (missing edge cases that I already addressed). |
I tried opening:
Note all the above have permissive licenses so it may be possible to borrow these tars for our test assets.
I used the test code below to open each, ignored those that opened successfully, and for those that failed compared whether some other tools could open them. The interesting cases are where other tools (particularly GNU tar) can open them, but we cannot. Note: I mostly didn't extract the entries, just checked they could be listed. In some cases, the tar can be listed, but extraction will fail.
test code I used
Possibly some of these are expected limitations, but for the others we should add checkboxes and work through and fix them.
The text was updated successfully, but these errors were encountered: