`FileType` guesstimating needs refactoring #257

jtmoon79 · 2024-03-22T21:42:16Z

Summary

file type estimating (guessing) is kind of messy.

Current behavior

the use of Mimetype adds nearly zero benefit for a lot of code
the resultant MimeGuess and FileType is confusing; which one matters when?
FileType guessing is hacky name matching

Suggested behavior

remove Mimetype and MimeGuess entirely (affects BlockReader should receive MimeGuess #15)
more robust and systematic approach to determining FileType based on the file name

If 1. and 2. are completed then a new Issue should be created around allowing the filepreprocessor.rs to read the zero block of the file and do some kind of magic fingerprint matching as well.
That change leads to another very large change wherein multiple FileTypes may be returned during file preprocessing, where the appropriate Reader is attempted and if it fails then the next Reader is attempted.

The text was updated successfully, but these errors were encountered:

jtmoon79 · 2024-04-25T07:49:39Z

new Issue should be created around allowing the filepreprocessor.rs to read the zero block of the file and do some kind of magic fingerprint matching as well.

https://github.com/bojand/infer for determining file type
https://github.com/anemele/filetype.rs for determining file type
https://lib.rs/crates/file-format for determining file type
https://lib.rs/crates/tree_magic_mini for determining file type

For Issue #16

https://lib.rs/crates/encoding_rs for determining text encoding

refactor `enum FileType` to embed archive and storage information in field variant `archival_type` Add variant `encoding_type` for `FileType::Text` refactor `pathbuf_to_filetype` to be more straightforward and recursive entirely remove `Mimeguess` Issue #15 (completed) This part 1 of completing the following issues: Issue #257 Issue #285

Refactor `path_to_filetype` to allow filetype_archive (gz, xz) for parseable files EVTX, FixedStruct, journal. Allow compressed `.tar` files. Only allows a "single level" of archival type. None of these are handled yet. This is part 2 of: Issue #257 Issue #285

jtmoon79 changed the title ~~file type estimating needs refactoring~~ FileType guesstimating needs refactoring Mar 22, 2024

jtmoon79 added code improvement enhancement not seen by the user P1 important labels Mar 22, 2024

jtmoon79 mentioned this issue Apr 6, 2024

Handle large number of files #270

Open

jtmoon79 added the difficult A difficult problem; a major coding effort or difficult algorithm to perfect label Apr 16, 2024

jtmoon79 mentioned this issue Apr 21, 2024

fully extract .journal, .evtx compressed/archived files to temporary files #284

Closed

jtmoon79 closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FileType` guesstimating needs refactoring #257

`FileType` guesstimating needs refactoring #257

jtmoon79 commented Mar 22, 2024 •

edited

Loading

jtmoon79 commented Apr 25, 2024 •

edited

Loading

FileType guesstimating needs refactoring #257

FileType guesstimating needs refactoring #257

Comments

jtmoon79 commented Mar 22, 2024 • edited Loading

Summary

Current behavior

Suggested behavior

jtmoon79 commented Apr 25, 2024 • edited Loading

`FileType` guesstimating needs refactoring #257

`FileType` guesstimating needs refactoring #257

jtmoon79 commented Mar 22, 2024 •

edited

Loading

jtmoon79 commented Apr 25, 2024 •

edited

Loading