TarReader throws on various archives that other tools accept #74316

danmoseley · 2022-08-21T23:04:51Z

I tried opening:

each of the tar files used to test Golang's tar package (here with details about each in the tests here).
each of the tar files used to test node-tar, found here.
each of the tar files used to test libarchive, found here. Note I had to uudecode these.

Note all the above have permissive licenses so it may be possible to borrow these tars for our test assets.

I used the test code below to open each, ignored those that opened successfully, and for those that failed compared whether some other tools could open them. The interesting cases are where other tools (particularly GNU tar) can open them, but we cannot. Note: I mostly didn't extract the entries, just checked they could be listed. In some cases, the tar can be listed, but extraction will fail.

test code I used

// See https://aka.ms/new-console-template for more information
using System.Formats.Tar;
using Xunit;

public static class C
{

    public async static Task Main()
    {
        List<Task> tasks = new();
        foreach (string path in Directory.EnumerateFiles(@"C:\git\go\src\archive\tar\testdata", "*.tar"))
        {
            tasks.Add(Task.Run(async () =>
            {
                TarEntry? entry = null;

                try
                {
                    //Console.WriteLine($"{path} opening...");
                    using FileStream fs = new(path, FileMode.Open);
                    using TarReader reader = new(fs, leaveOpen: false);

                    while ((entry = await reader.GetNextEntryAsync()) != null)
                    {
                        var ms = new MemoryStream();

                        Assert.NotEmpty(entry.Name);
                        Assert.True(Enum.IsDefined(entry.EntryType));
                        Assert.True(Enum.IsDefined(entry.Format));

                        if (entry.EntryType == TarEntryType.Directory)
                            continue;

                        var ds = entry.DataStream;
                        if (ds != null && ds.Length > 0)
                        {
                            ds.CopyTo(ms);
                        }
                    }
                }
                catch (Exception ex) //when (!(ex is FormatException))
                {
                    Console.WriteLine($"{path} opening {entry?.Name} threw {ex.Message}");
                }
            }));
        }

        await Task.WhenAll(tasks);
    }
}

source	Column1	issue	gnu tar	7z	golang	.NET	.NET Exception
golang	gnu-multi-hdrs.tar	duplicate headers	reads one	reads one w/warning	reads one	ERROR	A metadata entry of type 'LongPath' was unexpectedly found after a metadata entry of type 'LongPath'.
golang	gnu-incremental.tar	incremental format	reads ok		reads ok	ERROR	Unable to read beyond the end of the stream.
golang	invalid-go17.tar	??	reads ok	reads ok	reads ok	ERROR	Could not find any recognizable digits.
golang	hdr-only.tar	just header	reads with errors	reads ok	reads ok	ERROR	Additional non-parsable characters are at the end of the string.
golang	nil-uid.tar	zero uid	reads ok	reads w/warnings	reads ok	ERROR	Unable to read beyond the end of the stream.
golang	pax-multi-hdrs.tar	2 headers	reads ok	reads w/warnings	reads ok	ERROR	A metadata entry of type 'ExtendedAttributes' was unexpectedly found after a metadata entry of type 'ExtendedAttributes'.
golang	pax-bad-mtime-file.tar	bad modified time	reads ok	reads w/warnings		ERROR	Unable to read beyond the end of the stream.
golang	pax-pos-size-file.tar	?	reads ok	reads w/warnings	reads ok	ERROR	Unable to read beyond the end of the stream.
golang	v7.tar	v7	reads ok	reads ok	reads ok	ERROR	Could not find any recognizable digits.
golang	sparse-formats.tar	something about sparseness	reads ok		reads ok	ERROR	Additional non-parsable characters are at the end of the string.
golang	ustar-file-reg.tar	non-zero device numbers.	reads ok		reads ok	ERROR	Unable to read beyond the end of the stream.
golang	writer-big.tar	truncated huge	ERROR		reads ok	ERROR	Could not find any recognizable digits.
golang	pax-path-hdr.tar	?	reads empty	ERROR	reads header	ERROR	Unable to read beyond the end of the stream.
golang	writer-big-long.tar	truncated huge	ERROR	reads w/ unexpected end of data	reads ok	ERROR	Unable to read beyond the end of the stream.
mine	huge.tar	dd if=/dev/zero bs=1G count=16 > huge.tar		reads		ERROR	Value was either too large or too small for a UInt32
golang	issue10968.tar	garbled header			ERROR	ERROR (but OK)	Could not find any recognizable digits.
golang	issue11169.tar	??			ERROR	ERROR (but OK)	Additional non-parsable characters are at the end of the string.
golang	neg-size.tar	negative size	ERROR	refuses	ERROR	ERROR (but OK)	Could not find any recognizable digits.
golang	pax-bad-hdr-file.tar	bad header	reads with errors	reads ok	ERROR	ERROR (but OK)	Unable to read beyond the end of the stream.
node	long-pax.tar	120 byte filename (pax limit 100)	reads headers	reads w/ unexpected end of data		ERROR	120-byte-filename-cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc threw Unable to read beyond the end of the stream.
node	next-file-has-long.tar	link to 170 byte name in GNU				ERROR	Entry 'NextFileHasLongPath' was expected to be in the GNU format, but did not have the expected version data.
node	path-missing.tar	empty name	"Substituting `.' for empty member name" (but not clear this is useful..)	silently uses tar file name		ERROR on extraction	Cannot create 'c:\tar' because a file or directory with the same name already exists (NOTE -- we should probably fix to fail earlier, in GetDestinationAndLinkPaths())
node	links-strip.tar	?symlink and hardlinks	reads ok	reads w/ unexpected end of data		ERROR	Unable to read beyond the end of the stream.
mine	empty.tar	0 bytes	reads OK	reads ok		OK
libarchive	test_compat_gtar_2.tar	huge gid	reads OK	reads ok		ERROR	Could not find any recognizable digits.
libarchive	test_compat_perl_archive_tar.tar	?	reads OK	reads ok		ERROR	Could not find any recognizable digits.
libarchive	test_compat_gtar_1.tar	200 byte filenames and symlink?	reads OK	reads ok		ERROR	Could not find any recognizable digits.
libarchive	test_compat_plexus_archiver_tar.tar		reads OK w/tar: A lone zero block at 3	reads w/ There are some data after the end of the payload data		ERROR	Could not find any recognizable digits.
libarchive	test_compat_solaris_tar_acl.tar		reads OK w/Unknown file type ‘A’	reads ok		OK	(no exception, but unexpected TarEntryType 65 = 'A' .. A custom extension)
libarchive	test_compat_tar_hardlink_1.tar		reads OK	reads w/ unexpected end of data		ERROR	Could not find any recognizable digits.
libarchive	test_read_format_gtar_sparse_1_17_posix00.tar		reads OK	reads ok		ERROR	The entry './PaxHeaders.38659/sparse' has a duplicate extended attribute.
libarchive	test_read_format_tar_invalid_pax_size.tar		ERRORS	ERROR		ERROR	Could not find any recognizable digits.

Possibly some of these are expected limitations, but for the others we should add checkboxes and work through and fix them.

dotnet-issue-labeler · 2022-08-21T23:04:54Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2022-08-21T23:52:21Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

I tried opening each of the tar files used to test Golang's tar package (here with details about each in the tests here)

test code I used

// See https://aka.ms/new-console-template for more information
using System.Formats.Tar;
using Xunit;

public static class C
{

    public async static Task Main()
    {
        List<Task> tasks = new();
        foreach (string path in Directory.EnumerateFiles(@"C:\git\go\src\archive\tar\testdata", "*.tar"))
        {
            tasks.Add(Task.Run(async () =>
            {
                TarEntry? entry = null;

                try
                {
                    //Console.WriteLine($"{path} opening...");
                    using FileStream fs = new(path, FileMode.Open);
                    using TarReader reader = new(fs, leaveOpen: false);

                    while ((entry = await reader.GetNextEntryAsync()) != null)
                    {
                        var ms = new MemoryStream();

                        Assert.NotEmpty(entry.Name);
                        Assert.True(Enum.IsDefined(entry.EntryType));
                        Assert.True(Enum.IsDefined(entry.Format));

                        if (entry.EntryType == TarEntryType.Directory)
                            continue;

                        var ds = entry.DataStream;
                        if (ds != null && ds.Length > 0)
                        {
                            ds.CopyTo(ms);
                        }
                    }
                }
                catch (Exception ex) //when (!(ex is FormatException))
                {
                    Console.WriteLine($"{path} opening {entry?.Name} threw {ex.Message}");
                }
            }));
        }

        await Task.WhenAll(tasks);
    }
}

I ignored all the ones that opened successfully and assumed that was expected. Here are the failures:

Column1	issue	gnu tar	7z	golang	.NET	.NET2
gnu-multi-hdrs.tar	duplicate headers	reads one	reads one w/warning	reads one	ERROR	A metadata entry of type 'LongPath' was unexpectedly found after a metadata entry of type 'LongPath'.
gnu-incremental.tar	incremental format	reads ok		reads ok	ERROR	Unable to read beyond the end of the stream.
invalid-go17.tar	??	reads ok	reads ok	reads ok	ERROR	Could not find any recognizable digits.
hdr-only.tar	just header	reads with errors	reads ok	reads ok	ERROR	Additional non-parsable characters are at the end of the string.
nil-uid.tar	zero uid	reads ok	reads w/warnings	reads ok	ERROR	Unable to read beyond the end of the stream.
pax-multi-hdrs.tar	2 headers	reads ok	reads w/warnings	reads ok	ERROR	A metadata entry of type 'ExtendedAttributes' was unexpectedly found after a metadata entry of type 'ExtendedAttributes'.
pax-bad-mtime-file.tar	bad modified time	reads ok	reads w/warnings		ERROR	Unable to read beyond the end of the stream.
pax-pos-size-file.tar	?	reads ok	reads w/warnings	reads ok	ERROR	Unable to read beyond the end of the stream.
v7.tar	v7	reads ok	reads ok	reads ok	ERROR	Could not find any recognizable digits.
sparse-formats.tar	something about sparseness	reads ok		reads ok	ERROR	Additional non-parsable characters are at the end of the string.
ustar-file-reg.tar	non-zero device numbers.	reads ok		reads ok	ERROR	Unable to read beyond the end of the stream.
writer-big.tar	truncated huge	ERROR		reads ok	ERROR	Could not find any recognizable digits.
pax-path-hdr.tar	?	reads empty	ERROR	reads header	ERROR	Unable to read beyond the end of the stream.
writer-big-long.tar	truncated huge	ERROR	reads w/ unexpected end of data	reads ok	ERROR	Unable to read beyond the end of the stream.
huge.tar	dd if=/dev/zero bs=1G count=16 > huge.tar		reads		ERROR	Value was either too large or too small for a UInt32
issue10968.tar	garbled header			ERROR	ERROR (but OK)	Could not find any recognizable digits.
issue11169.tar	??			ERROR	ERROR (but OK)	Additional non-parsable characters are at the end of the string.
neg-size.tar	negative size	ERROR	refuses	ERROR	ERROR (but OK)	Could not find any recognizable digits.
pax-bad-hdr-file.tar	bad header	reads with errors	reads ok	ERROR	ERROR (but OK)	Unable to read beyond the end of the stream.

Possibly some of these are expected limitations, but for the others we should add checkboxes and work through and fix them.

Author:	danmoseley
Assignees:	-
Labels:	`area-System.IO`, `untriaged`
Milestone:	-

danmoseley · 2022-08-22T01:05:57Z

Aside, I noticed this resource string does not have a format position, so it drops the path:

            if (!File.Exists(sourceFileName))
            {
                throw new FileNotFoundException(string.Format(SR.IO_FileNotFound, sourceFileName));
            }

another example

            if (!string.IsNullOrEmpty(directoryPath) && !Path.Exists(directoryPath))
            {
                throw new IOException(string.Format(SR.IO_PathNotFound_NoPathName, filePath));
            }

MichalPetryka · 2022-08-22T01:11:50Z

What do the empty cells in the table mean?

MichalPetryka · 2022-08-22T01:30:39Z

Also, as @jaredpar mentioned in an old blogpost using File.Exists before opening the file is unreliable due to FS changes from other sources.

danmoseley · 2022-08-22T01:55:57Z

@MichalPetryka it means I didn't try it.

using File.Exists before opening the file is unreliable due to FS changes from other sources.

Jared's point is something like: even if you check File.Exists, you still need to handle the possibility of the file not existing a moment later when you try to read it. In this case, if such a thing happened, an exception would be thrown to the caller which is fine. Yes, as matter of style, or maximum efficiency, we could catch FileNotFoundException instead, so long as the message was just as good.

danmoseley · 2022-08-22T01:59:28Z

Added node and libarchive results. Thanks @am11 for pointing out all 3 of these. The other ones he found are GPL. We can probably still test that we can open those, we just can't copy them for our test assets, however rather than reason about that I'll leave them for now.

danmoseley · 2022-08-22T02:15:46Z

cc @carlossanlop

danmoseley · 2022-08-22T02:27:40Z

Here is another corpus under permissive license. I don't have time to run the code above on those too, but we should do that after we fix the bugs above.
https://github.com/alexcrichton/tar-rs/tree/master/tests/archives

am11 · 2022-08-22T02:30:57Z

https://github.com/alexcrichton/tar-rs/tree/master/tests/archives

One of them is failing like this:

/home/am11/projects/consoletar/tar-rs/tests/archives/spaces.tar ->

Unhandled Exception:
System.FormatException: Unable to parse number.
   at System.Formats.Tar.TarHelpers.ThrowInvalidNumber() in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs:line 225
   at System.Formats.Tar.TarHelpers.ParseOctal[UInt32](ReadOnlySpan`1 buffer) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs:line 214
   at System.Formats.Tar.TarHeader.TryReadCommonAttributes(Span`1 buffer, TarEntryFormat initialFormat) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 360
   at System.Formats.Tar.TarHeader.TryReadAttributes(TarEntryFormat initialFormat, Span`1 buffer) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 66
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData, TarEntryFormat initialFormat) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 30
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(Boolean copyData) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs:line 322
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs:line 124
   at Program.<<Main>$>g__ReadTar|0_0(String path, Boolean useGz) in /home/am11/projects/consoletar/Program.cs:line 27
   at Program.<Main>$(String[] args) in /home/am11/projects/consoletar/Program.cs:line 15

am11 · 2022-08-22T02:32:52Z

This delta fixes all issues with node-tar fixtures: main...am11:runtime:feature/system.formats.tar/hardlinks-support (working on tests).

danmoseley · 2022-08-22T02:35:50Z

Some in here (mostly compressed, would have to be decompressed, so perhaps not exercising tar much) https://github.com/adamhathcock/sharpcompress/tree/master/tests/TestArchives/Archives

danmoseley · 2022-08-22T02:41:12Z

A few from OpenBSD, uuencoded and BSD license - https://github.com/openbsd/src/tree/master/regress/bin/pax

am11 · 2022-08-22T02:43:14Z

For tar.gz and .tgz, we can use GZipStream in our runner. For .xz and .7z, we don't have OOTB support yet (#1542), so I think we can use a 3p library in tests.

I will add a few files from node-tar which failed to parse with current main (to avoid repetitions) in dotnet/runtime-assets repo, with TPNs.

danmoseley · 2022-08-22T02:45:28Z

Compression is a strict wrapper over tar, right? So if there are any useful tar's we want to test with (and license allows) that are tar.gz or whatever, we can simply uncompress them into .tar before putting them in the test assets?

danmoseley · 2022-08-22T02:47:15Z

BTW, as mentioned above I largely ignored tar files that successfully opened, although these may suggest missing positive cases. That's why I put empty.tar in the list above, although I now see we do have a test for that (EmptyArchive()).. At an extreme, we could grab all of them that license allows and put them in the test assets to ensure that we don't regress opening them.

carlossanlop · 2022-08-22T04:25:07Z

Compression is a strict wrapper over tar, right? So if there are any useful tar's we want to test with (and license allows) that are tar.gz or whatever, we can simply uncompress them into .tar before putting them in the test assets?

Agreed. Additionally, the compression method used isn't too important. What really matters, if we want to emulate a compressed archive, is to make the archive stream unseekable, and we can achieve this by wrapping the main stream with a WrappedStream constructed with CanSeek=false, and pass that to the TarReader.

danmoseley · 2022-08-22T04:27:59Z

I ran the same test over the files in https://github.com/alexcrichton/tar-rs/tree/master/tests/archives and got 3 failures:

7z_long_path.tar -- Entry '././@LongLink' was expected to be in the GNU format, but did not have the expected version data.
empty_filename.tar -- Unable to read beyond the end of the stream.
spaces.tar -- Could not find any recognizable digits.

empty_filename looks like it's already in the table, not sure about the other two. I'll add all of those (not just the failures) to runtime-assets as well.

carlossanlop · 2022-08-22T04:36:32Z

This delta fixes all issues with node-tar fixtures: main...am11:runtime:feature/system.formats.tar/hardlinks-support (working on tests).

This is great, @am11, thank you so much for working on the fix! Ping me when a PR is out, we would want to get this backported to rc1.

@danmoseley I noticed a few lines in that diff (SequenceEquals calls) would depend on the recent perf improvements PR that has pending approval for backporting to rc1.

Mentioning this in case we can consider this fix's dependency as an additional argument in favor to backport the perf PR. Otherwise, this fix would have to be adjusted as a backport.

am11 · 2022-08-24T16:29:28Z

FWIW, I have tested libarchive and GNU tar(1) and they both extract those five tar files successfully. GNU tar warns about the last one (but still extracts the contents successfully):

test_compat_plexus_archiver_tar.tar ->
tar: A lone zero block at 3

I have not tested those files with rust and node, but based on previous conversation, if majority of tools are OK with these misalignments, we should tolerate them as well.

MichalPetryka · 2022-08-24T16:42:07Z

And 7zip:
test_read_format_gtar_sparse_1_17_posix00.tar - is fine
test_compat_plexus_archiver_tar.tar - warning
test_compat_tar_hardlink_1.tar - unexpected end of data
test_read_format_gtar_sparse_1_13.tar - cannot open as a tar archive
test_read_format_gtar_sparse_1_17.tar - header error

carlossanlop · 2022-08-24T17:09:54Z

@MichalPetryka can you share the exception messages?

MichalPetryka · 2022-08-24T17:11:59Z

@MichalPetryka can you share the exception messages?

carlossanlop · 2022-08-24T17:13:39Z

Oh apologies. I missed that you said 7-zip. I thought those were errors on the .NET APIs. Thanks for sharing.

danmoseley · 2022-08-25T22:34:59Z

@carlossanlop do you feel we must do more work here for 7.0? It seems to me we are in a reasonable place based on the data we have. These test tar's are synthetic after all. I suggest to move this to Future. We should certainly pick up the libarchive tar's if they let us know it's OK.

carlossanlop · 2022-08-25T22:37:43Z

No more work for 7.0 in my opinion, @danmoseley. Thanks for moving the milestone.

stephentoub · 2022-08-27T13:38:31Z

How confident are we in the opposite direction, that tars produced by TarWriter are consumable by all commonly-used tools? Do we have tests for that direction, e.g. generate various outputs with TarWriter, shell out to tar to unpack, and compare that everything roundtripped as expected?

mthalman · 2022-11-11T22:07:36Z

FWIW, I've run into a real-world scenario of a tarball that can't be read with .NET runtime 7.0.0. I'm attempting to read the tarball for a Fedora container image layer. It fails with this callstack:

System.IO.InvalidDataException
  HResult=0x80131501
  Message=Unable to parse number.
  Source=System.Formats.Tar
  StackTrace:
   at System.Formats.Tar.TarHelpers.ThrowInvalidNumber() in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs:line 233
   at System.Formats.Tar.TarHelpers.ParseOctal[T](ReadOnlySpan`1 buffer) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs:line 225
   at System.Formats.Tar.TarHeader.TryReadCommonAttributes(Span`1 buffer, TarEntryFormat initialFormat) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 372
   at System.Formats.Tar.TarHeader.TryReadAttributes(TarEntryFormat initialFormat, Span`1 buffer) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 66
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData, TarEntryFormat initialFormat, Boolean processDataBlock) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 30
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(Boolean copyData) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs:line 313
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs:line 123
   at Program.<<Main>$>d__0.MoveNext() in C:\Users\mthalman\source\repos\ConsoleApp4\ConsoleApp4\Program.cs:line 10
   at Program.<Main>(String[] args)

Here's my repro:

using System.Formats.Tar;

var client = new HttpClient();
var message = await client.GetAsync(
    "https://registry.fedoraproject.org/v2/fedora/blobs/sha256:7a05f01240abe225dc6c0178dd0fa67874478dadaba59f33efde33bcfb242d93");
var layerStream = await message.Content.ReadAsStreamAsync();

using var layerReader = new TarReader(layerStream);
while (true)
{
    var layerEntry = layerReader.GetNextEntry();
    if (layerEntry is null)
    {
        return;
    }
}

am11 · 2022-11-11T23:34:58Z

FWIW, I've run into a real-world scenario of a tarball that can't be read with .NET runtime 7.0.0. I'm attempting to read the tarball for a Fedora container image layer. It fails with this callstack:

We had the following patch in #74358, which was matching BSD & GNU tar(1) as well as libarchive's behavior: ignore non-octal bytes when reading the attributes

--- a/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
+++ b/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
@@ -221,6 +221,11 @@ internal static TarEntryType GetCorrectTypeFlagForFormat(TarEntryFormat format,
             buffer = TrimEndingNullsAndSpaces(buffer);
             buffer = TrimLeadingNullsAndSpaces(buffer);
 
+            // skip leading non-octal bytes
+            int offset = 0;
+            for (; offset < buffer.Length && (buffer[offset] < (byte)'0' || buffer[offset] > (byte)'7'); ++offset);
+            buffer = buffer.Slice(offset);
+
             if (buffer.Length == 0)
             {
                 return T.Zero;

it was rejected because it seemed too permissive (going by the standard; which suggests to only ignore 0 and 32 from ASCII table; Fedora image has ACK (ASCII 06) at index 0...)

adamsitnik · 2022-11-14T12:06:36Z

I've run into a real-world scenario of a tarball that can't be read with .NET runtime 7.0.0
it was rejected because it seemed too permissive

@carlossanlop @jozkee @jeffhandley should we reconsider our approach?

bartonjs · 2022-11-14T18:59:51Z

Not without understanding why the Fedora image has an ACK there. Otherwise, the correct thing to patch would be whatever generated that tar, not .NET.

mthalman · 2022-11-14T19:02:52Z

@omajid may be of assistance here in tracking down what tool is used to produce the tarball. I saw the same issue with CentOS, which is in the same distro family as Fedora.

@omajid - see my post here: #74316 (comment)

omajid · 2022-11-15T15:59:17Z

From what I can tell, what Fedora does goes like this:

Create a VM
Install packages into it
Get the VM's disk image
Mount disk image locally, and then run tar cf on it
Write docker metadata to tarball using python's tarfile.TarFile class
Recompress the tarball using xz directly
Save the image and then later push it to dockerhub

Here's a build log that shows all the steps except the last one: https://kojipkgs.fedoraproject.org//packages/Fedora-Container-Base/37/20221110.n.0/data/logs/image/oz-x86_64.log

The underlying tool seems to be https://github.com/redhat-imaging/imagefactory. I can't see much more because the website is dead: http://imgfac.org/

I am not terribly familiar with this part of Fedora, so I might have some details wrong.

am11 · 2022-11-16T18:57:29Z

Otherwise, the correct thing to patch would be whatever generated that tar, not .NET.

This doc suggests that there is also a constraint on length of octal digits (six, seven, eleven depending on format, when it was implementation -- pre/post POSIX 1988 standard and the platform). We don't check for length at all, so we are lenient as-is. Following what libarchive does for the octal (find first octal digit in header, read until the first non-octal digit or end of buffer), it won't be setting new precedence. Our TarWriter produces clean headers (octal followed by null), reader can be permissive.

AngelMunoz · 2023-05-19T06:31:11Z

Hey @mthalman I just ran into the unable to parse number issue as well using .NET's http client

Apparently the stream is being gziped twice I got a clue from here: icsharpcode/SharpZipLib#514

In my case I was doing a similar thing:

task {
  use! str = http.GetStreamAsync url
  // System.IO.InvalidDataException: Unable To Parse Number
  TarFile.ExtractToDirectory(source, output, false)
}

Wrapping the stream from the http client made it work

task {
  use! str = http.GetStreamAsync url
  // It works
  use outer = new GZipStream(source, CompressionMode.Decompress)
  TarFile.ExtractToDirectory(outer, output, false)
}

It might be worth checking if it is a similar issue :)

Edit:

I'm not sure if this was a particular case of the .NET's HttpClient or a gzip gzip from node themselves I just know wrapping it made it work 😅

danmoseley · 2023-05-19T14:19:54Z

If that's common, I could imagine Tar giving a hint in the error, based on the Gzip magic number.

NickeManarin · 2023-07-16T05:22:15Z

Another real-life example of the "Unable to parse number" exception:

https://github.com/ImageOptim/gifski/releases/download/1.11.0/gifski-1.11.0.tar.xz

   at System.Formats.Tar.TarHelpers.ThrowInvalidNumber()
   at System.Formats.Tar.TarHelpers.ParseOctal[T](ReadOnlySpan`1 buffer)
   at System.Formats.Tar.TarHeader.TryReadCommonAttributes(Span`1 buffer, TarEntryFormat initialFormat)
   at System.Formats.Tar.TarHeader.TryReadAttributes(TarEntryFormat initialFormat, Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData, TarEntryFormat initialFormat, Boolean processDataBlock)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at System.Formats.Tar.TarFile.ExtractToDirectoryInternal(Stream source, String destinationDirectoryPath, Boolean overwriteFiles, Boolean leaveOpen)

Wrapping in a GZipStream doesn't help.

carlossanlop · 2023-07-16T17:55:41Z

@NickeManarin the *.xz extension means the archive was compressed using the LZMA algorithm: https://www.tutorialspoint.com/using-xz-compression-in-linux#:~:text=The%20xz%20compression%20algorithm%20works,block%20independently%20using%20LZMA%20algorithm.

We don't yet support the LZMA algorithm in System.IO.Compression but we do have an issue tracking the request to eventually add it: it:#1542

A very easy workaround is to import CSharpCompress to read the LZMA part of your archive, and then pass it to the System.Formats.Tar.TarFile stream-based extraction method. I tested it and it works:

using SharpCompress.Compressors.Xz;
using System.Formats.Tar;
using System.IO;

class CSharpTestClass
{
    static void Main()
    {
        string tarXzArchivePath = @"D:\Downloads\gifski-1.11.0.tar.xz";
        string destinationDirectoryPath = @"D:\Downloads\extractedxz";

        if (!Directory.Exists(destinationDirectoryPath))
        {
            Directory.CreateDirectory(destinationDirectoryPath);
        }

        using FileStream file = File.Open(tarXzArchivePath, FileMode.Open);
        using XZStream xzStream = new(file);
        TarFile.ExtractToDirectory(xzStream, destinationDirectoryPath, overwriteFiles: false);
    }
}

Hope that helps!

NickeManarin · 2023-07-17T02:47:37Z

@carlossanlop Thanks, that worked!

danmoseley · 2023-07-17T02:55:25Z

Would we accept a change that, on failure to decompress, included the compression format in the message (by looking at magic numbers presumably). That might help in cases like this. It could just be best effort.
Hmm maybe we already decided we didn't want to, I can't remember. But IIRC others have had this kind of confusion.

carlossanlop · 2023-07-17T22:29:23Z

I agree, we could add logic to TarReader to detect a compressed archive by reading the magic numbers. I opened issue #89056 to track that request specifically. I would like people looking for the error Unable to parse number get directed there.

This issue can be closed since it was tracking a different problem (missing edge cases that I already addressed).

ghost added the untriaged New issue has not been triaged by the area owner label Aug 21, 2022

danmoseley mentioned this issue Aug 21, 2022

TarReader fails to read tar file with hardlinks; throws System.IO.EndOfStreamException #74309

Closed

am11 added the area-System.IO label Aug 21, 2022

danmoseley mentioned this issue Aug 22, 2022

Fix tar string formatting with more parameters than placeholders #74321

Merged

am11 mentioned this issue Aug 22, 2022

Port a few fixtures from node-tar dotnet/runtime-assets#263

Merged

danmoseley mentioned this issue Aug 22, 2022

Port fixtures from golang tar dotnet/runtime-assets#264

Merged

carlossanlop assigned carlossanlop and am11 Aug 22, 2022

carlossanlop removed the untriaged New issue has not been triaged by the area owner label Aug 22, 2022

carlossanlop added this to the 7.0.0 milestone Aug 22, 2022

danmoseley mentioned this issue Aug 22, 2022

Add test tar files from tar-rs dotnet/runtime-assets#265

Merged

danmoseley modified the milestones: 7.0.0, Future Aug 25, 2022

jozkee mentioned this issue Sep 2, 2022

Tar: Fix buffer is too small error when writing Prefix on large filenames #75023

Closed

jeffhandley added area-System.Formats.Tar and removed area-System.IO labels Nov 21, 2022

carlossanlop mentioned this issue Jul 17, 2023

Tar: Detect magic numbers in archives compressed with popular algorithms #89056

Open

carlossanlop closed this as completed Jul 17, 2023

ghost locked as resolved and limited conversation to collaborators Aug 17, 2023

TarReader throws on various archives that other tools accept #74316

TarReader throws on various archives that other tools accept #74316

Comments

danmoseley commented Aug 21, 2022 • edited Loading

dotnet-issue-labeler bot commented Aug 21, 2022

ghost commented Aug 21, 2022

danmoseley commented Aug 22, 2022 • edited Loading

MichalPetryka commented Aug 22, 2022

MichalPetryka commented Aug 22, 2022

danmoseley commented Aug 22, 2022

danmoseley commented Aug 22, 2022 • edited Loading

danmoseley commented Aug 22, 2022

danmoseley commented Aug 22, 2022

am11 commented Aug 22, 2022

am11 commented Aug 22, 2022

danmoseley commented Aug 22, 2022

danmoseley commented Aug 22, 2022

am11 commented Aug 22, 2022

danmoseley commented Aug 22, 2022

danmoseley commented Aug 22, 2022

carlossanlop commented Aug 22, 2022

danmoseley commented Aug 22, 2022

carlossanlop commented Aug 22, 2022

am11 commented Aug 24, 2022 • edited Loading

MichalPetryka commented Aug 24, 2022

carlossanlop commented Aug 24, 2022

MichalPetryka commented Aug 24, 2022

carlossanlop commented Aug 24, 2022

danmoseley commented Aug 25, 2022

carlossanlop commented Aug 25, 2022

stephentoub commented Aug 27, 2022

mthalman commented Nov 11, 2022

am11 commented Nov 11, 2022

adamsitnik commented Nov 14, 2022

bartonjs commented Nov 14, 2022

mthalman commented Nov 14, 2022

omajid commented Nov 15, 2022 • edited Loading

am11 commented Nov 16, 2022

AngelMunoz commented May 19, 2023 • edited Loading

danmoseley commented May 19, 2023

NickeManarin commented Jul 16, 2023

carlossanlop commented Jul 16, 2023 • edited Loading

NickeManarin commented Jul 17, 2023

danmoseley commented Jul 17, 2023 • edited Loading

carlossanlop commented Jul 17, 2023

danmoseley commented Aug 21, 2022 •

edited

Loading

danmoseley commented Aug 22, 2022 •

edited

Loading

danmoseley commented Aug 22, 2022 •

edited

Loading

am11 commented Aug 24, 2022 •

edited

Loading

omajid commented Nov 15, 2022 •

edited

Loading

AngelMunoz commented May 19, 2023 •

edited

Loading

carlossanlop commented Jul 16, 2023 •

edited

Loading

danmoseley commented Jul 17, 2023 •

edited

Loading