Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate file in tar archive causes read to fail #1400

Open
deitch opened this issue Dec 13, 2022 · 9 comments · Fixed by #1445
Open

duplicate file in tar archive causes read to fail #1400

deitch opened this issue Dec 13, 2022 · 9 comments · Fixed by #1445
Assignees
Labels
bug Something isn't working

Comments

@deitch
Copy link
Contributor

deitch commented Dec 13, 2022

Please provide a set of steps on how to reproduce the issue

  1. Create a simple directory with just a few files. e.g.
$ mkdir /tmp/syft
$ cp $(which syft) /tmp/syft
$ echo foo > /tmp/syft/abc
  1. Create a tar file with those contents:
$ tar -C /tmp/syft -cvf /tmp/syft.tar
  1. Use syft to scan the dir, it all works:
$ syft dir:/tmp/syft # or just `syft /tmp/syft`
  1. Use syft to scan the tar file, it all works:
$ syft file:/tmp/syft.tar
  1. Add a duplicate file to the tar file
$ tar -C /tmp/syft -rvf /tmp/syft.tar ./abc
  1. Scan the tar file, it fails
$ syft file:/tmp/syft.tar
 ✔ Indexed /tmp/syft.tar
 ✔ Cataloged packages      [0 packages]

[0000]  WARN file could not be unarchived: reading file in tar archive: file already exists: /tmp/syft-archive-contents-1809756098/abc
No packages discovered

What happened:

syft refuses to scan a tar file when there are duplicate entries. Because of tar's sequential structure (it is a tape archive 😁 ), this is legitimate. Further, when untarring a tar file, tar generally extracts later files over previous ones, unless explicitly set to fail on duplicate. However, syft fails outright.

What you expected to happen:

I expected it to process the tar file successfully. At the least, there should be options to fail-on-duplicate or continue-on-duplicate (the default tar behaviour)

Anything else we need to know?:

No.

Environment:

  • Output of syft version:
Application:        syft
Version:            0.59.0
JsonSchemaVersion:  4.1.0
BuildDate:          2022-10-17T16:13:44Z
GitCommit:          41bc6bb410352845f22766e27dd48ba93aa825a4
GitDescription:     v0.59.0
Platform:           linux/amd64
GoVersion:          go1.18.7
Compiler:           gc

(although I tried it with v0.63.0 via docker as well)

  • OS (e.g: cat /etc/os-release or similar):
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

(although I ran it inside the official syft docker images as well)

@deitch deitch added the bug Something isn't working label Dec 13, 2022
@kzantow
Copy link
Contributor

kzantow commented Dec 13, 2022

Interesting problem @deitch! The one issue I see is that Syft operates on .tar files by extracting them to a temporary directory. If this happens with duplicate files, at least one of the files will be lost and Syft won't be able to properly scan the entire archive. Your suggestion of an option for something like --continue-on-duplicate seems appropriate, or what would you think of something like --duplicate-tar-files first|last|<number> to be able to specify behavior?

@deitch
Copy link
Contributor Author

deitch commented Dec 13, 2022

If this happens with duplicate files, at least one of the files will be lost and Syft won't be able to properly scan the entire archive.

That is true. And since both the first and second copies of abc are in the archive, if syft only gets the latter, it would not be a complete scan.

I think that is better than the current "will not scan at all" behaviour.

Perhaps the best would be if it could just read the entire tar archive as a stream, rather than extracting it, but I understand that there may be issues with that, like following links? I don't really know.

Your suggestion of an option for something like --continue-on-duplicate seems appropriate, or what would you think of something like --duplicate-tar-files first|last| to be able to specify behavior?

My instinct would be to give a WARN on duplicates but keep going, thus scanning the latest, with an option to fail-on-duplicates.

what would you think of something like --duplicate-tar-files first|last| to be able to specify behavior

I am not sure it fits the expected behaviour. tar behaviour is "last entry wins", which fits fine with syft taking the last one. Recognizing that it might be an incomplete sbom, because it misses others that were masked, a user might want to say, "hold on, if my sbom is incomplete, error out." I have a hard time seeing why someone might want to specifically pick one of the others. That would digress from normal tar behaviour (take last or optionally error out) without solving the "incomplete sbom" issue.

I think the user would want either to take last consistent with tar (default); or error out because of incomplete sbom (option). Anything else doesn't fit with either.

@kzantow
Copy link
Contributor

kzantow commented Dec 13, 2022

@deitch good point -- if a user extracts the tar, and the behavior is always that the last entry wins, I'd agree a warning here is probably sufficient as the first duplicate entry would essentially get overwritten. I'll bring this up wit the team today and see if we can get some consensus -- if so, this sounds like a pretty simple change.

@kzantow kzantow added this to OSS Dec 14, 2022
@kzantow kzantow moved this to Backlog (Pulled Forward for Priority) in OSS Dec 14, 2022
@deitch
Copy link
Contributor Author

deitch commented Dec 19, 2022

Is there anything I can do to help?

@kzantow kzantow self-assigned this Dec 21, 2022
@kzantow kzantow moved this from Backlog (Pulled Forward for Priority) to In Progress (Actively Resolving) in OSS Dec 21, 2022
@kzantow kzantow moved this from In Progress (Actively Resolving) to In Review in OSS Jan 10, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in OSS Jan 10, 2023
@deitch
Copy link
Contributor Author

deitch commented May 11, 2023

It looks like this is mostly fixed but not entirely.

If the file being replaced is a symlink, then it tries to follow the symlink and replace its target, rather than the link itself. This is an issue in archiver, not in syft per se, so I will open an issue there and link it here.

@deitch
Copy link
Contributor Author

deitch commented May 11, 2023

See mholt/archiver#380

@deitch
Copy link
Contributor Author

deitch commented May 11, 2023

See the linked issue. syft still uses archiver/v3, which no longer is supported. v4 doesn't have this issue, but it requires more work on the consumer's part.

@kzantow kzantow reopened this May 11, 2023
@kzantow
Copy link
Contributor

kzantow commented May 11, 2023

@deitch reopened this to update to archiver/v4 👍

@kzantow kzantow moved this from Done to Backlog in OSS May 12, 2023
@deitch
Copy link
Contributor Author

deitch commented Jul 12, 2024

Hi @kzantow following up on this one. Any success?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

2 participants