Remove unnecessary tests from a magic database to speed up MIME type detection with libmagic
Sometimes, you might want to know exactly what the type of a file is. This might be
particularly relevant in a security context where you cannot rely solely on file
extensions. One way to do that is to use
libmagic (and its command
file) and try to detect
the MIME type of the file. To achieve
this task, libmagic relies on magic files,
often called magic.mgc
on linux distributions, containing heuristic tests.
Going through the tests to determine the MIME type of one file is already a long and resource intensive process. Now imagine, you want to block some file types, on the fly, before they can harm the devices on your network. This requires repeating this process hundreds or even thousands of times in short lapses of time. That is the reason we started this project. We wanted to speed up the search performed by libmagic when the goal is to identify only a subset of MIME types. To achieve our goal, we take the official magic files and remove all the unnecessary tests for the detection of the MIME types we are interested in.
See the Install section for more details on the requirements of our tools.
With our CLI tool, you can quickly create a minimal magic file called mini_magic
containing
only the necessary tests to detect application/pdf
and application/x-executable
from the magic
files located at Magdir
:
mini-magic --mime-types application/pdf,application/x-executable --src Magdir --magic-filename mini_magic
The minimal magic file, mini_magic
, can either be used as is by libmagic
:
# -i causes the file command to output mime type strings
file -m mini_magic some_file -i
or it can be first compiled to further improve the performance:
file -C -m mini_magic
This produces the compiled magic database mini_magic.mgc
.
For more details on the CLI capabilities and options use the flag --help
.
We also provide a perl module called MiniMagic
. You can achieve the same
result as the CLI tool with the following code snippet:
# list of MIME types you want to detect
my $mime_types = ["application/pdf", "application/x-executable"];
# path to the directory containing the magic files
my $src_dir = "Magdir";
# name of the mini magic file containing only the necessary tests
my $magic_name = "mini_magic";
create_mini_magic_file($mime_types, $src_dir, $magic_name);
See the API section for more details about the module.
Finally, for those who do not want to deal with the dependencies, we created a docker image for this project. To build the image you can simply run the following command from the root of the project:
# mini-magic is the name of the new docker image
docker build -t mini-magic .
Once the build is done, you can run the following command to create the same magic file as in the previous examples:
docker run -v "$(pwd):/data" mini-magic --mime-types application/pdf,application/x-executable --magic-filename /data/mini_magic
use MiniMagic qw/create_mini_magic_file download_magic_files list_mime_types print_list_mime_types/;
create_mini_magic_file
creates a magic file called $magic_name
containing
all tests needed to detect the MIME types listed in the (referenced) array
$mime_types
. For this, it uses the definition located at $src_dir
.
download_magic_files
downloads all the magic files compatible with
libmagic version $version
from the official repository
and save them to the directory $src_dir
.
list_mime_types
creates a (referenced) array containing all MIME types covered
by the magic files in the directory $src_dir
.
print_list_mime_types
prints all MIME types covered by the magic files in
the directory $src_dir
.
All tests and benchmarks can be run as follow:
perl name_script.pl
The module MiniMagic
(and the CLI tool mini-magic
) requires perl 5.28
to function properly. In addition to that, you might need to
install the following additional
modules:
- Log::Any
- Log::Any::Adapter::Dispatch
- Log::Log4perl
- Log::Any::Adapter::Log4perl
- Const::Fast
- IPC::Run
- LWP::Simple
- Archive::Extract
- File::Slurper
- File::Copy::Recursive