Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directories in ingested zips are errantly treated as files #15

Open
ajnelson-nist opened this issue Sep 23, 2019 · 0 comments
Open

Directories in ingested zips are errantly treated as files #15

ajnelson-nist opened this issue Sep 23, 2019 · 0 comments

Comments

@ajnelson-nist
Copy link
Contributor

I wrote a test to confirm that files in an aff4 archive created with pyaff4 match what I expect them to be, by using aff.py --extract-all. Unfortunately, dumping files fails, because a directory from my input is treated like a file. It appears to be an issue that affects all directories.

This processing path follows creating an aff4 archive from scratch using a zip. (Particularly, this is a zipped LoC Bag, though I don't think that has an impact apart from an internal path name not entirely relevant to this bug.) Reproduction instructions are included.

Suspected diagnosis

Every member of a zip, whether a file or directory, appears to be assigned the type aff4:FileImage per the --meta dump from the .aff4 file. I'm guessing in-zip directories should instead be aff4:FolderImage, as this query is being used to feed a loop:

for imageUrn in resolver.QueryPredicateObject(volume.urn, lexicon.AFF4_TYPE, lexicon.standard11.FileImage)

And in that loop, every FileImage is being created/treated as regular file. A directory thrown in the mix raises a IsADirectoryError.

Suspected correction

In the function BasicZipFile.parse_cd, somewhere before the info message on line 694, a check needs to be made for the file being a directory. The since-Python-3.6 method of checking for the last character of the name being "/" should do.

However, I don't know the code well enough to suggest where that information be integrated (aside from a check soon after fn is defined in that function), and propagated to causing a aff4:FolderImage. The ZipInfo class in that file?

Steps to reproduce

The code segments below work when run as individual shell scripts, confirmed on an Ubuntu 18.04 system.

  1. Create a zip with some directory in it.
#!/bin/bash

# step1.sh

rm -rf deep flat
mkdir -p flat
mkdir -p deep/input_dir_1

echo 'file 1' > flat/file1.txt
echo 'file 2' > flat/file2.txt
pushd flat
  zip -r ../flat.zip .
popd
rm -r flat

echo 'file 3' > deep/file3.txt
echo 'file 4' > deep/input_dir_1/file4.txt
pushd deep
  zip -r ../deep.zip .
popd
rm -r deep
  1. Ingest the zips into their respective aff4 archives.
#!/bin/bash

# step2.sh

# (First loading venv, fixing path to aff4.py ...)

python .../aff4.py \
  --hash \
  --ingest \
  --paranoid \
  --recursive \
  flat.aff4 \
  flat.zip

python .../aff4.py \
  --hash \
  --ingest \
  --paranoid \
  --recursive \
  deep.aff4 \
  deep.zip
  1. Extract everything from the flat aff4 archive. Currently works.

Pull Request 14 fixes an unrelated issue with the way extractAll is called, and updates Pull Request 13 as a matter of convenience---I also found some of @gonmator's fixes while fixing this call.

#!/bin/bash

# step3.sh

# (First loading venv, fixing path to aff4.py ...)

rm -rf extraction_flat
mkdir extraction_flat

# Note that the last argument here will not be necessary if PR 16 is incorporated.
python .../aff4.py \
  --extract-all \
  --folder extraction_flat \
  flat.aff4 \
  extraction_flat
  1. Extract everything from the aff4 archive. Currently fails.

PR 14 should be integrated in order to see step3.sh below fail in the illustrative way.

#!/bin/bash

# step4.sh

# (First loading venv, fixing path to aff4.py ...)

rm -rf extraction_deep
mkdir extraction_deep

# Note that the last argument here will not be necessary if PR 16 is incorporated.
python .../aff4.py \
  --extract-all \
  --folder extraction_deep \
  deep.aff4 \
  extraction_deep

Traceback of step4.sh:

Traceback (most recent call last):
  File "../deps/pyaff4/aff4.py", line 421, in <module>
    main(sys.argv)
  File "../deps/pyaff4/aff4.py", line 414, in main
    extractAll(dest, args.folder)
  File "../deps/pyaff4/aff4.py", line 312, in extractAll
    with open(destFile, "wb") as destStream:
IsADirectoryError: [Errno 21] Is a directory: 'extraction_deep/deep.zip/input_dir_1'

Resolution confirmation

When step4.sh above creates this file hierarchy, this Issue's good to close.

$ find extraction
extraction_deep
extraction_deep/deep.zip
extraction_deep/deep.zip/file3.txt
extraction_deep/deep.zip/input_dir_1
extraction_deep/deep.zip/input_dir_1/file4.txt
blschatz pushed a commit that referenced this issue Oct 21, 2019
* Added /build/ directory to .gitignore

* Fixed bug initialising aff4.LogicalImage instances

* fixed bugs in extract()

* fixed aff4.extractAll() NameError exeption

* Fix call to extractAll function

This patch is partially necessary to correct the `--extract-all` flag.
With this correction, `--extract-all` will work if there were no
directories ingested from an input zip.

Issue 15 reports on the problem with directories from an ingested zip.
#15

This patch builds on Pull Request 13, as I'd also found the binary-
output mode was necessary, though in a different spot.
#13

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant