Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TarReader throws on various archives that other tools accept #74316

Closed
danmoseley opened this issue Aug 21, 2022 · 60 comments
Closed

TarReader throws on various archives that other tools accept #74316

danmoseley opened this issue Aug 21, 2022 · 60 comments
Assignees
Milestone

Comments

@danmoseley
Copy link
Member

danmoseley commented Aug 21, 2022

I tried opening:

  1. each of the tar files used to test Golang's tar package (here with details about each in the tests here).
  2. each of the tar files used to test node-tar, found here.
  3. each of the tar files used to test libarchive, found here. Note I had to uudecode these.

Note all the above have permissive licenses so it may be possible to borrow these tars for our test assets.

I used the test code below to open each, ignored those that opened successfully, and for those that failed compared whether some other tools could open them. The interesting cases are where other tools (particularly GNU tar) can open them, but we cannot. Note: I mostly didn't extract the entries, just checked they could be listed. In some cases, the tar can be listed, but extraction will fail.

test code I used
// See https://aka.ms/new-console-template for more information
using System.Formats.Tar;
using Xunit;

public static class C
{

    public async static Task Main()
    {
        List<Task> tasks = new();
        foreach (string path in Directory.EnumerateFiles(@"C:\git\go\src\archive\tar\testdata", "*.tar"))
        {
            tasks.Add(Task.Run(async () =>
            {
                TarEntry? entry = null;

                try
                {
                    //Console.WriteLine($"{path} opening...");
                    using FileStream fs = new(path, FileMode.Open);
                    using TarReader reader = new(fs, leaveOpen: false);

                    while ((entry = await reader.GetNextEntryAsync()) != null)
                    {
                        var ms = new MemoryStream();

                        Assert.NotEmpty(entry.Name);
                        Assert.True(Enum.IsDefined(entry.EntryType));
                        Assert.True(Enum.IsDefined(entry.Format));

                        if (entry.EntryType == TarEntryType.Directory)
                            continue;

                        var ds = entry.DataStream;
                        if (ds != null && ds.Length > 0)
                        {
                            ds.CopyTo(ms);
                        }
                    }
                }
                catch (Exception ex) //when (!(ex is FormatException))
                {
                    Console.WriteLine($"{path} opening {entry?.Name} threw {ex.Message}");
                }
            }));
        }

        await Task.WhenAll(tasks);
    }
}
 
source Column1 issue gnu tar 7z golang .NET .NET Exception
golang gnu-multi-hdrs.tar duplicate headers reads one reads one w/warning reads one ERROR A metadata entry of type 'LongPath' was unexpectedly found after a metadata entry of type 'LongPath'.
golang gnu-incremental.tar incremental format reads ok reads ok ERROR Unable to read beyond the end of the stream.
golang invalid-go17.tar ?? reads ok reads ok reads ok ERROR Could not find any recognizable digits.
golang hdr-only.tar just header reads with errors reads ok reads ok ERROR Additional non-parsable characters are at the end of the string.
golang nil-uid.tar zero uid reads ok reads w/warnings reads ok ERROR Unable to read beyond the end of the stream.
golang pax-multi-hdrs.tar 2 headers reads ok reads w/warnings reads ok ERROR A metadata entry of type 'ExtendedAttributes' was unexpectedly found after a metadata entry of type 'ExtendedAttributes'.
golang pax-bad-mtime-file.tar bad modified time reads ok reads w/warnings ERROR Unable to read beyond the end of the stream.
golang pax-pos-size-file.tar ? reads ok reads w/warnings reads ok ERROR Unable to read beyond the end of the stream.
golang v7.tar v7 reads ok reads ok reads ok ERROR Could not find any recognizable digits.
golang sparse-formats.tar something about sparseness reads ok reads ok ERROR Additional non-parsable characters are at the end of the string.
golang ustar-file-reg.tar non-zero device numbers. reads ok reads ok ERROR Unable to read beyond the end of the stream.
golang writer-big.tar truncated huge ERROR reads ok ERROR Could not find any recognizable digits.
golang pax-path-hdr.tar ? reads empty ERROR reads header ERROR Unable to read beyond the end of the stream.
golang writer-big-long.tar truncated huge ERROR reads w/ unexpected end of data reads ok ERROR Unable to read beyond the end of the stream.
mine huge.tar dd if=/dev/zero bs=1G count=16 > huge.tar reads ERROR Value was either too large or too small for a UInt32
golang issue10968.tar garbled header ERROR ERROR (but OK) Could not find any recognizable digits.
golang issue11169.tar ?? ERROR ERROR (but OK) Additional non-parsable characters are at the end of the string.
golang neg-size.tar negative size ERROR refuses ERROR ERROR (but OK) Could not find any recognizable digits.
golang pax-bad-hdr-file.tar bad header reads with errors reads ok ERROR ERROR (but OK) Unable to read beyond the end of the stream.
node long-pax.tar 120 byte filename (pax limit 100) reads headers reads w/ unexpected end of data ERROR 120-byte-filename-cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc threw Unable to read beyond the end of the stream.
node next-file-has-long.tar link to 170 byte name in GNU ERROR Entry 'NextFileHasLongPath' was expected to be in the GNU format, but did not have the expected version data.
node path-missing.tar empty name "Substituting `.' for empty member name" (but not clear this is useful..) silently uses tar file name ERROR on extraction Cannot create 'c:\tar' because a file or directory with the same name already exists (NOTE -- we should probably fix to fail earlier, in GetDestinationAndLinkPaths())
node links-strip.tar ?symlink and hardlinks reads ok reads w/ unexpected end of data ERROR Unable to read beyond the end of the stream.
mine empty.tar 0 bytes reads OK reads ok OK
libarchive test_compat_gtar_2.tar huge gid reads OK reads ok ERROR Could not find any recognizable digits.
libarchive test_compat_perl_archive_tar.tar ? reads OK reads ok ERROR Could not find any recognizable digits.
libarchive test_compat_gtar_1.tar 200 byte filenames and symlink? reads OK reads ok ERROR Could not find any recognizable digits.
libarchive test_compat_plexus_archiver_tar.tar reads OK w/tar: A lone zero block at 3 reads w/ There are some data after the end of the payload data ERROR Could not find any recognizable digits.
libarchive test_compat_solaris_tar_acl.tar reads OK w/Unknown file type ‘A’ reads ok OK (no exception, but unexpected TarEntryType 65 = 'A' .. A custom extension)
libarchive test_compat_tar_hardlink_1.tar reads OK reads w/ unexpected end of data ERROR Could not find any recognizable digits.
libarchive test_read_format_gtar_sparse_1_17_posix00.tar reads OK reads ok ERROR The entry './PaxHeaders.38659/sparse' has a duplicate extended attribute.
libarchive test_read_format_tar_invalid_pax_size.tar ERRORS ERROR ERROR Could not find any recognizable digits.

Possibly some of these are expected limitations, but for the others we should add checkboxes and work through and fix them.

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 21, 2022
@ghost
Copy link

ghost commented Aug 21, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

I tried opening each of the tar files used to test Golang's tar package (here with details about each in the tests here)

test code I used
// See https://aka.ms/new-console-template for more information
using System.Formats.Tar;
using Xunit;

public static class C
{

    public async static Task Main()
    {
        List<Task> tasks = new();
        foreach (string path in Directory.EnumerateFiles(@"C:\git\go\src\archive\tar\testdata", "*.tar"))
        {
            tasks.Add(Task.Run(async () =>
            {
                TarEntry? entry = null;

                try
                {
                    //Console.WriteLine($"{path} opening...");
                    using FileStream fs = new(path, FileMode.Open);
                    using TarReader reader = new(fs, leaveOpen: false);

                    while ((entry = await reader.GetNextEntryAsync()) != null)
                    {
                        var ms = new MemoryStream();

                        Assert.NotEmpty(entry.Name);
                        Assert.True(Enum.IsDefined(entry.EntryType));
                        Assert.True(Enum.IsDefined(entry.Format));

                        if (entry.EntryType == TarEntryType.Directory)
                            continue;

                        var ds = entry.DataStream;
                        if (ds != null && ds.Length > 0)
                        {
                            ds.CopyTo(ms);
                        }
                    }
                }
                catch (Exception ex) //when (!(ex is FormatException))
                {
                    Console.WriteLine($"{path} opening {entry?.Name} threw {ex.Message}");
                }
            }));
        }

        await Task.WhenAll(tasks);
    }
}
 

I ignored all the ones that opened successfully and assumed that was expected. Here are the failures:

Column1 issue gnu tar 7z golang .NET .NET2
gnu-multi-hdrs.tar duplicate headers reads one reads one w/warning reads one ERROR A metadata entry of type 'LongPath' was unexpectedly found after a metadata entry of type 'LongPath'.
gnu-incremental.tar incremental format reads ok reads ok ERROR Unable to read beyond the end of the stream.
invalid-go17.tar ?? reads ok reads ok reads ok ERROR Could not find any recognizable digits.
hdr-only.tar just header reads with errors reads ok reads ok ERROR Additional non-parsable characters are at the end of the string.
nil-uid.tar zero uid reads ok reads w/warnings reads ok ERROR Unable to read beyond the end of the stream.
pax-multi-hdrs.tar 2 headers reads ok reads w/warnings reads ok ERROR A metadata entry of type 'ExtendedAttributes' was unexpectedly found after a metadata entry of type 'ExtendedAttributes'.
pax-bad-mtime-file.tar bad modified time reads ok reads w/warnings ERROR Unable to read beyond the end of the stream.
pax-pos-size-file.tar ? reads ok reads w/warnings reads ok ERROR Unable to read beyond the end of the stream.
v7.tar v7 reads ok reads ok reads ok ERROR Could not find any recognizable digits.
sparse-formats.tar something about sparseness reads ok reads ok ERROR Additional non-parsable characters are at the end of the string.
ustar-file-reg.tar non-zero device numbers. reads ok reads ok ERROR Unable to read beyond the end of the stream.
writer-big.tar truncated huge ERROR reads ok ERROR Could not find any recognizable digits.
pax-path-hdr.tar ? reads empty ERROR reads header ERROR Unable to read beyond the end of the stream.
writer-big-long.tar truncated huge ERROR reads w/ unexpected end of data reads ok ERROR Unable to read beyond the end of the stream.
huge.tar dd if=/dev/zero bs=1G count=16 > huge.tar reads ERROR Value was either too large or too small for a UInt32
issue10968.tar garbled header ERROR ERROR (but OK) Could not find any recognizable digits.
issue11169.tar ?? ERROR ERROR (but OK) Additional non-parsable characters are at the end of the string.
neg-size.tar negative size ERROR refuses ERROR ERROR (but OK) Could not find any recognizable digits.
pax-bad-hdr-file.tar bad header reads with errors reads ok ERROR ERROR (but OK) Unable to read beyond the end of the stream.

Possibly some of these are expected limitations, but for the others we should add checkboxes and work through and fix them.

Author: danmoseley
Assignees: -
Labels:

area-System.IO, untriaged

Milestone: -

@danmoseley
Copy link
Member Author

danmoseley commented Aug 22, 2022

Aside, I noticed this resource string does not have a format position, so it drops the path:

            if (!File.Exists(sourceFileName))
            {
                throw new FileNotFoundException(string.Format(SR.IO_FileNotFound, sourceFileName));
            }

another example

            if (!string.IsNullOrEmpty(directoryPath) && !Path.Exists(directoryPath))
            {
                throw new IOException(string.Format(SR.IO_PathNotFound_NoPathName, filePath));
            }

@MichalPetryka
Copy link
Contributor

What do the empty cells in the table mean?

@MichalPetryka
Copy link
Contributor

Also, as @jaredpar mentioned in an old blogpost using File.Exists before opening the file is unreliable due to FS changes from other sources.

@danmoseley
Copy link
Member Author

@MichalPetryka it means I didn't try it.

using File.Exists before opening the file is unreliable due to FS changes from other sources.

Jared's point is something like: even if you check File.Exists, you still need to handle the possibility of the file not existing a moment later when you try to read it. In this case, if such a thing happened, an exception would be thrown to the caller which is fine. Yes, as matter of style, or maximum efficiency, we could catch FileNotFoundException instead, so long as the message was just as good.

@danmoseley
Copy link
Member Author

danmoseley commented Aug 22, 2022

Added node and libarchive results. Thanks @am11 for pointing out all 3 of these. The other ones he found are GPL. We can probably still test that we can open those, we just can't copy them for our test assets, however rather than reason about that I'll leave them for now.

@danmoseley
Copy link
Member Author

cc @carlossanlop

@danmoseley
Copy link
Member Author

Here is another corpus under permissive license. I don't have time to run the code above on those too, but we should do that after we fix the bugs above.
https://github.com/alexcrichton/tar-rs/tree/master/tests/archives

@am11
Copy link
Member

am11 commented Aug 22, 2022

https://github.com/alexcrichton/tar-rs/tree/master/tests/archives

One of them is failing like this:

/home/am11/projects/consoletar/tar-rs/tests/archives/spaces.tar ->

Unhandled Exception:
System.FormatException: Unable to parse number.
   at System.Formats.Tar.TarHelpers.ThrowInvalidNumber() in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs:line 225
   at System.Formats.Tar.TarHelpers.ParseOctal[UInt32](ReadOnlySpan`1 buffer) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs:line 214
   at System.Formats.Tar.TarHeader.TryReadCommonAttributes(Span`1 buffer, TarEntryFormat initialFormat) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 360
   at System.Formats.Tar.TarHeader.TryReadAttributes(TarEntryFormat initialFormat, Span`1 buffer) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 66
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData, TarEntryFormat initialFormat) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 30
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(Boolean copyData) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs:line 322
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData) in /runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs:line 124
   at Program.<<Main>$>g__ReadTar|0_0(String path, Boolean useGz) in /home/am11/projects/consoletar/Program.cs:line 27
   at Program.<Main>$(String[] args) in /home/am11/projects/consoletar/Program.cs:line 15

@am11
Copy link
Member

am11 commented Aug 22, 2022

This delta fixes all issues with node-tar fixtures: main...am11:runtime:feature/system.formats.tar/hardlinks-support (working on tests).

@danmoseley
Copy link
Member Author

Some in here (mostly compressed, would have to be decompressed, so perhaps not exercising tar much) https://github.com/adamhathcock/sharpcompress/tree/master/tests/TestArchives/Archives

@danmoseley
Copy link
Member Author

A few from OpenBSD, uuencoded and BSD license - https://github.com/openbsd/src/tree/master/regress/bin/pax

@am11
Copy link
Member

am11 commented Aug 22, 2022

For tar.gz and .tgz, we can use GZipStream in our runner. For .xz and .7z, we don't have OOTB support yet (#1542), so I think we can use a 3p library in tests.

I will add a few files from node-tar which failed to parse with current main (to avoid repetitions) in dotnet/runtime-assets repo, with TPNs.

@danmoseley
Copy link
Member Author

Compression is a strict wrapper over tar, right? So if there are any useful tar's we want to test with (and license allows) that are tar.gz or whatever, we can simply uncompress them into .tar before putting them in the test assets?

@danmoseley
Copy link
Member Author

BTW, as mentioned above I largely ignored tar files that successfully opened, although these may suggest missing positive cases. That's why I put empty.tar in the list above, although I now see we do have a test for that (EmptyArchive()).. At an extreme, we could grab all of them that license allows and put them in the test assets to ensure that we don't regress opening them.

@carlossanlop
Copy link
Member

Compression is a strict wrapper over tar, right? So if there are any useful tar's we want to test with (and license allows) that are tar.gz or whatever, we can simply uncompress them into .tar before putting them in the test assets?

Agreed. Additionally, the compression method used isn't too important. What really matters, if we want to emulate a compressed archive, is to make the archive stream unseekable, and we can achieve this by wrapping the main stream with a WrappedStream constructed with CanSeek=false, and pass that to the TarReader.

@danmoseley
Copy link
Member Author

I ran the same test over the files in https://github.com/alexcrichton/tar-rs/tree/master/tests/archives and got 3 failures:

7z_long_path.tar -- Entry '././@LongLink' was expected to be in the GNU format, but did not have the expected version data.
empty_filename.tar -- Unable to read beyond the end of the stream.
spaces.tar -- Could not find any recognizable digits.

empty_filename looks like it's already in the table, not sure about the other two. I'll add all of those (not just the failures) to runtime-assets as well.

@carlossanlop
Copy link
Member

This delta fixes all issues with node-tar fixtures: main...am11:runtime:feature/system.formats.tar/hardlinks-support (working on tests).

This is great, @am11, thank you so much for working on the fix! Ping me when a PR is out, we would want to get this backported to rc1.

@danmoseley I noticed a few lines in that diff (SequenceEquals calls) would depend on the recent perf improvements PR that has pending approval for backporting to rc1.

Mentioning this in case we can consider this fix's dependency as an additional argument in favor to backport the perf PR. Otherwise, this fix would have to be adjusted as a backport.

@carlossanlop carlossanlop removed the untriaged New issue has not been triaged by the area owner label Aug 22, 2022
@carlossanlop carlossanlop added this to the 7.0.0 milestone Aug 22, 2022
@am11
Copy link
Member

am11 commented Aug 24, 2022

FWIW, I have tested libarchive and GNU tar(1) and they both extract those five tar files successfully. GNU tar warns about the last one (but still extracts the contents successfully):

test_compat_plexus_archiver_tar.tar ->
tar: A lone zero block at 3

I have not tested those files with rust and node, but based on previous conversation, if majority of tools are OK with these misalignments, we should tolerate them as well.

@MichalPetryka
Copy link
Contributor

And 7zip:
test_read_format_gtar_sparse_1_17_posix00.tar - is fine
test_compat_plexus_archiver_tar.tar - warning
test_compat_tar_hardlink_1.tar - unexpected end of data
test_read_format_gtar_sparse_1_13.tar - cannot open as a tar archive
test_read_format_gtar_sparse_1_17.tar - header error

@carlossanlop
Copy link
Member

@MichalPetryka can you share the exception messages?

@MichalPetryka
Copy link
Contributor

@MichalPetryka can you share the exception messages?

image

@carlossanlop
Copy link
Member

Oh apologies. I missed that you said 7-zip. I thought those were errors on the .NET APIs. Thanks for sharing.

@danmoseley
Copy link
Member Author

@carlossanlop do you feel we must do more work here for 7.0? It seems to me we are in a reasonable place based on the data we have. These test tar's are synthetic after all. I suggest to move this to Future. We should certainly pick up the libarchive tar's if they let us know it's OK.

@danmoseley danmoseley modified the milestones: 7.0.0, Future Aug 25, 2022
@carlossanlop
Copy link
Member

No more work for 7.0 in my opinion, @danmoseley. Thanks for moving the milestone.

@stephentoub
Copy link
Member

How confident are we in the opposite direction, that tars produced by TarWriter are consumable by all commonly-used tools? Do we have tests for that direction, e.g. generate various outputs with TarWriter, shell out to tar to unpack, and compare that everything roundtripped as expected?

@mthalman
Copy link
Member

FWIW, I've run into a real-world scenario of a tarball that can't be read with .NET runtime 7.0.0. I'm attempting to read the tarball for a Fedora container image layer. It fails with this callstack:

System.IO.InvalidDataException
  HResult=0x80131501
  Message=Unable to parse number.
  Source=System.Formats.Tar
  StackTrace:
   at System.Formats.Tar.TarHelpers.ThrowInvalidNumber() in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs:line 233
   at System.Formats.Tar.TarHelpers.ParseOctal[T](ReadOnlySpan`1 buffer) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs:line 225
   at System.Formats.Tar.TarHeader.TryReadCommonAttributes(Span`1 buffer, TarEntryFormat initialFormat) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 372
   at System.Formats.Tar.TarHeader.TryReadAttributes(TarEntryFormat initialFormat, Span`1 buffer) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 66
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData, TarEntryFormat initialFormat, Boolean processDataBlock) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs:line 30
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(Boolean copyData) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs:line 313
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData) in /_/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs:line 123
   at Program.<<Main>$>d__0.MoveNext() in C:\Users\mthalman\source\repos\ConsoleApp4\ConsoleApp4\Program.cs:line 10
   at Program.<Main>(String[] args)

Here's my repro:

using System.Formats.Tar;

var client = new HttpClient();
var message = await client.GetAsync(
    "https://registry.fedoraproject.org/v2/fedora/blobs/sha256:7a05f01240abe225dc6c0178dd0fa67874478dadaba59f33efde33bcfb242d93");
var layerStream = await message.Content.ReadAsStreamAsync();

using var layerReader = new TarReader(layerStream);
while (true)
{
    var layerEntry = layerReader.GetNextEntry();
    if (layerEntry is null)
    {
        return;
    }
}

@am11
Copy link
Member

am11 commented Nov 11, 2022

FWIW, I've run into a real-world scenario of a tarball that can't be read with .NET runtime 7.0.0. I'm attempting to read the tarball for a Fedora container image layer. It fails with this callstack:

We had the following patch in #74358, which was matching BSD & GNU tar(1) as well as libarchive's behavior: ignore non-octal bytes when reading the attributes

--- a/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
+++ b/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
@@ -221,6 +221,11 @@ internal static TarEntryType GetCorrectTypeFlagForFormat(TarEntryFormat format,
             buffer = TrimEndingNullsAndSpaces(buffer);
             buffer = TrimLeadingNullsAndSpaces(buffer);
 
+            // skip leading non-octal bytes
+            int offset = 0;
+            for (; offset < buffer.Length && (buffer[offset] < (byte)'0' || buffer[offset] > (byte)'7'); ++offset);
+            buffer = buffer.Slice(offset);
+
             if (buffer.Length == 0)
             {
                 return T.Zero;

it was rejected because it seemed too permissive (going by the standard; which suggests to only ignore 0 and 32 from ASCII table; Fedora image has ACK (ASCII 06) at index 0...)

@adamsitnik
Copy link
Member

I've run into a real-world scenario of a tarball that can't be read with .NET runtime 7.0.0
it was rejected because it seemed too permissive

@carlossanlop @jozkee @jeffhandley should we reconsider our approach?

@bartonjs
Copy link
Member

Not without understanding why the Fedora image has an ACK there. Otherwise, the correct thing to patch would be whatever generated that tar, not .NET.

@mthalman
Copy link
Member

@omajid may be of assistance here in tracking down what tool is used to produce the tarball. I saw the same issue with CentOS, which is in the same distro family as Fedora.

@omajid - see my post here: #74316 (comment)

@omajid
Copy link
Member

omajid commented Nov 15, 2022

From what I can tell, what Fedora does goes like this:

  • Create a VM
  • Install packages into it
  • Get the VM's disk image
  • Mount disk image locally, and then run tar cf on it
  • Write docker metadata to tarball using python's tarfile.TarFile class
  • Recompress the tarball using xz directly
  • Save the image and then later push it to dockerhub

Here's a build log that shows all the steps except the last one: https://kojipkgs.fedoraproject.org//packages/Fedora-Container-Base/37/20221110.n.0/data/logs/image/oz-x86_64.log

The underlying tool seems to be https://github.com/redhat-imaging/imagefactory. I can't see much more because the website is dead: http://imgfac.org/

I am not terribly familiar with this part of Fedora, so I might have some details wrong.

@am11
Copy link
Member

am11 commented Nov 16, 2022

Otherwise, the correct thing to patch would be whatever generated that tar, not .NET.

This doc suggests that there is also a constraint on length of octal digits (six, seven, eleven depending on format, when it was implementation -- pre/post POSIX 1988 standard and the platform). We don't check for length at all, so we are lenient as-is. Following what libarchive does for the octal (find first octal digit in header, read until the first non-octal digit or end of buffer), it won't be setting new precedence. Our TarWriter produces clean headers (octal followed by null), reader can be permissive.

@AngelMunoz
Copy link

AngelMunoz commented May 19, 2023

Hey @mthalman I just ran into the unable to parse number issue as well using .NET's http client

Apparently the stream is being gziped twice I got a clue from here: icsharpcode/SharpZipLib#514

In my case I was doing a similar thing:

task {
  use! str = http.GetStreamAsync url
  // System.IO.InvalidDataException: Unable To Parse Number
  TarFile.ExtractToDirectory(source, output, false)
}

Wrapping the stream from the http client made it work

task {
  use! str = http.GetStreamAsync url
  // It works
  use outer = new GZipStream(source, CompressionMode.Decompress)
  TarFile.ExtractToDirectory(outer, output, false)
}

It might be worth checking if it is a similar issue :)

Edit:

I'm not sure if this was a particular case of the .NET's HttpClient or a gzip gzip from node themselves I just know wrapping it made it work 😅

@danmoseley
Copy link
Member Author

If that's common, I could imagine Tar giving a hint in the error, based on the Gzip magic number.

@NickeManarin
Copy link

Another real-life example of the "Unable to parse number" exception:

https://github.com/ImageOptim/gifski/releases/download/1.11.0/gifski-1.11.0.tar.xz

   at System.Formats.Tar.TarHelpers.ThrowInvalidNumber()
   at System.Formats.Tar.TarHelpers.ParseOctal[T](ReadOnlySpan`1 buffer)
   at System.Formats.Tar.TarHeader.TryReadCommonAttributes(Span`1 buffer, TarEntryFormat initialFormat)
   at System.Formats.Tar.TarHeader.TryReadAttributes(TarEntryFormat initialFormat, Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData, TarEntryFormat initialFormat, Boolean processDataBlock)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at System.Formats.Tar.TarFile.ExtractToDirectoryInternal(Stream source, String destinationDirectoryPath, Boolean overwriteFiles, Boolean leaveOpen)

Wrapping in a GZipStream doesn't help.

@carlossanlop
Copy link
Member

carlossanlop commented Jul 16, 2023

@NickeManarin the *.xz extension means the archive was compressed using the LZMA algorithm: https://www.tutorialspoint.com/using-xz-compression-in-linux#:~:text=The%20xz%20compression%20algorithm%20works,block%20independently%20using%20LZMA%20algorithm.

We don't yet support the LZMA algorithm in System.IO.Compression but we do have an issue tracking the request to eventually add it: it:#1542

A very easy workaround is to import CSharpCompress to read the LZMA part of your archive, and then pass it to the System.Formats.Tar.TarFile stream-based extraction method. I tested it and it works:

using SharpCompress.Compressors.Xz;
using System.Formats.Tar;
using System.IO;

class CSharpTestClass
{
    static void Main()
    {
        string tarXzArchivePath = @"D:\Downloads\gifski-1.11.0.tar.xz";
        string destinationDirectoryPath = @"D:\Downloads\extractedxz";

        if (!Directory.Exists(destinationDirectoryPath))
        {
            Directory.CreateDirectory(destinationDirectoryPath);
        }

        using FileStream file = File.Open(tarXzArchivePath, FileMode.Open);
        using XZStream xzStream = new(file);
        TarFile.ExtractToDirectory(xzStream, destinationDirectoryPath, overwriteFiles: false);
    }
}

Hope that helps!

@NickeManarin
Copy link

@carlossanlop Thanks, that worked!

@danmoseley
Copy link
Member Author

danmoseley commented Jul 17, 2023

Would we accept a change that, on failure to decompress, included the compression format in the message (by looking at magic numbers presumably). That might help in cases like this. It could just be best effort.
Hmm maybe we already decided we didn't want to, I can't remember. But IIRC others have had this kind of confusion.

@carlossanlop
Copy link
Member

I agree, we could add logic to TarReader to detect a compressed archive by reading the magic numbers. I opened issue #89056 to track that request specifically. I would like people looking for the error Unable to parse number get directed there.

This issue can be closed since it was tracking a different problem (missing edge cases that I already addressed).

@ghost ghost locked as resolved and limited conversation to collaborators Aug 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests