A utility for staging files, calculating and validating file checksums, and comparing checksum values between storage locations.
- Python (version 3.6+)
- Pip (version 19.0+)
MPT works best within a virtual environment. To create a new virtual environment, start a command prompt and enter the following command:
python -m venv [path-to-venv-directory]
This will create a directory structure in [path-to-venv-directory]
containing all the necessary configuration and
data files required. The virtual environment can be activated by entering one of the following at the command prompt:
Windows:
[path-to-venv-directory]\Scripts\activate.bat
Linux:
source [path-to-venv-directory]/bin/activate
When you've activated the virtual environment, install MPT from a Git repository:
pip install git+http://github.com/britishlibrary/mpt
Or from a local source:
pip install /path/to/mpt-source/
All dependencies should be automatically downloaded and installed as part of pip's install process.
In order to automatically e-mail summary reports, MPT requires that three environment variables be set:
MAIL_SERVER = mail.example.com MAIL_SERVER_PORT = 587 MAIL_SENDER_ADDRESS = <the sender address you wish displayed in all e-mails>
An example of MAIL_SENDER_ADDRESS
might be Bitwise Checks <do_not_reply@example.com>
On Windows, these should be set via Control Panel > System > Advanced System Settings > Environment Variables.
On Linux, these should be added to the ~/.bash_profile
or ~/.profile
file for the user running MPT.
MPT has several modes of operation.
MPT can calculate checksums for an existing collection of files, and store those checksums in a 'checksum tree' which mimics the directory structure of the original files. Optionally it can also store these checksum values in a single manifest file.
mpt create dir -t TREE [-a ALGORITHM] [--formats FORMATS ] [-m MANIFEST] [-r]
The various command line options and arguments are described below.
The directory of files to process.
Use the -t
or --tree
option to specify the directory in which the 'checksum tree' should be created. A
checksum file will be created in the tree for each file checked. The name and path to the checksum file will mirror
that of the original file checked.
Use the -r
or --recursive
option to process all sub-folders beneath the given directory. By default only
the top-level directory will be processed.
Use the -a
or --algorithm
option to specify the checksum algorithm to use. A number of different algorithms
are supported (use mpt create -h
to list them all). The default algorithm is sha256
.
Use the --formats
option to limit checksum creation to files with a particular file extension.
Use the -m
or --manifest
option to specify a manifest file to be created in addition to the 'checksum tree'.
mpt create -r c:\storage\files -t c:\storage\checksums -m c:\storage\manifest.sha256 --formats tiff tif
This will create checksums for all files ending in tiff
or tif
in c:\storage\files
and all subdirectories. The SHA256
algorithm will be used as the default option. The resulting 'checksum tree' will be created in c:\storage\checksums
mirroring the original directory structure. A manifest file containing all checksums will also be created (if it does
not already exist) or updated at c:\storage\manifest.sha256
.
MPT can verify the checksums of all files listed in a 'checksum tree' created by the creation or staging mode.
mpt validate_tree dir -t TREE [-r]
The various command line options and arguments are described below.
The root directory of files to validate.
Use the -t
or --tree
option to specify the root directory of the 'checksum tree' used to validate the data
files.
Use the -r
or --recursive
option to process all sub-folders beneath the given directory. By default only
the top-level directory will be processed.
mpt validate_tree -r c:\storage\files -t c:\storage\checksums
This will validate all data files in c:\storage\files
and all subdirectories. Each file will be validated using its
checksum file in the 'checksum tree' in c:\storage\checksums
.
MPT can verify the checksums of all files listed in a manifest file created by the creation or staging mode.
mpt validate_manifest dir -m MANIFEST [-r] [-a ALGORITHM]
The various command line options and arguments are described below.
The root directory of files to validate.
Use the -m
or --manifest
option to specify the location of the manifest file used to validate the data
files.
Use the -a
or --algorithm
option to specify the checksum algorithm to use. A number of different algorithms
are supported (use mpt validate_manifest -h
to list them all). The default algorithm is sha256
.
mpt validate_manifest c:\storage\files -m c:\storage\manifest.sha256
This will validate all data files in c:\storage\files
and all subdirectories. Each file will be validated using its
entry in the manifest file c:\storage\manifest.sha256
.
MPT can compare the checksums stored in a 'checksum tree' to other 'trees' stored in different locations in order to detect any discrepancies.
mpt compare_trees dir -t OTHER_TREES
The various command line options and arguments are described below.
The root directory of the master checksum tree to use as a base of comparison.
Use the -t
or --trees
option to specify the location of other checksum trees to compare to the master.
mpt compare_trees c:\storage\checksums -t q:\backup_storage_1\checksums z:\backup_storage_2\checksums
This will compare all checksum files in the 'checksum tree' located in c:\storage\checksums
against the
corresponding files in q:\backup_storage_1\checksums
and z:\backup_storage_2\checksums
and highlight any
discrepancies.
MPT can compare the checksums stored in a manifest file to manifests in other locations in order to detect any discrepancies.
mpt compare_manifests manifest -m OTHER_MANIFESTS
The various command line options and arguments are described below.
The path to the master manifest file to use as a base of comparison.
Use the -m
or --other_manifests
option to specify the location of other manifests to compare to the master.
mpt compare_manifests c:\storage\manifest.sha256 -m q:\backup_storage_1\manifest.sha256 z:\backup_storage_2\manifest.sha256
This will compare all entries in the manifest file c:\storage\manifest.sha256
against the
corresponding files q:\backup_storage_1\manifest.sha256
and z:\backup_storage_2\manifest.sha256
and highlight
any discrepancies.
File staging involves processing all files in a particular directory and moving them to one or more storage locations, calculating their checksums in the process.
If staging is successful for all destinations then the original file will be removed from the staging area. If any part of the staging process fails for a particular file, then the entire staging process will be backed out for that file. This is to ensure that the staged file is present either in all destinations or in none.
For example, if a file is successfully copied to three out of four destinations, but fails on the fourth destination, the file will be removed from each of the three other nodes. The final summary report would describe the details of the error condition for the one destination which failed, while the other three would be listed as "Unstaged."
mpt stage dir -d DESTINATIONS [-a ALGORITHM] [-t TREES] [-m MANIFESTS ] [--max-failures MAX_FAILURES]
The various command line options and arguments are described below.
The directory of files to be staged.
Use the -d
or --destinations
option to specify the root directory of each staging destination (i.e. where the
files should be moved to). These destinations can be in any order, but the order must be consistent between this option
and the --trees
and --manifests
options if they are used.
If the --trees
option to specify 'checksum tree' locations is omitted, then the files will actually be staged to a
subdirectory named files
directly beneath each specified staging destination.
Use the -a
or --algorithm
option to specify the checksum algorithm to use. A number of different algorithms
are supported (use mpt stage -h
to list them all). The default algorithm is sha256
.
Use the -t
or --trees
option to specify the root directory of each destination checksum tree (i.e. where the
checksums should be stored in each staging destination).
If provided, then these destination tree paths must be listed in the same order as the staging destinations listed for
the --destinations
option - e.g. the first path listed for -t
must be for the checksum tree corresponding to
the first destination listed for the -d
option, and so on.
If this option is omitted altogether, then checksum trees will actually be created in a subdirectory named checksums
directly beneath each specified staging destination.
Use the -m
or --manifests
option to specify the location of a manifest file to create or update in each staging
destination.
If provided, then these manifest paths must be listed in the same order as the staging destinations listed for the
--destinations
option - e.g. the first manifest listed for -m
must be for the manifest corresponding to the
first destination listed for the -d
option, and so on.
If this option is omitted altogether, then no manifest files will be created.
By default, staging mode will prompt the user to confirm that all file paths are correct before commencing. Using the
--no-confirm
option will bypass this prompt. The intention is for the user to prepare and test their command-line
syntax interactively using the confirmation prompt as a guide, and use the --no-confirm
option when scheduling the
staging process to run automatically.
By default, staging will be aborted if 10 consecutive write failures occur. Use the --max-failures
option to
override this threshold.
By default, any empty folders left in the staging directory will be deleted once staging is complete. Using the
--keep-staging-folders
option will change this behaviour and leave empty folders untouched. This may be useful in
cases where a complex hierarchical structure needs to be maintained for new files and maintaining an empty file system
in the staging directory is easier than recreating the structure for each run.
Example 1 (use defaults for file & checksum destinations):
mpt stage f:\staging -d c:\storage q:\backup_storage_1 z:\backup_storage_2
This will process all files in f:\staging
and create output in the following locations
Destination | Files | Checksums | Manifest |
1 | c:\storage\files |
c:\storage\checksums |
None |
2 | q:\backup_storage_1\files |
q:\backup_storage_1\checksums |
None |
3 | z:\backup_storage_2\files |
z:\backup_storage_2\checksums |
None |
Example 2 (use specific checksum & manifest locations):
mpt stage f:\staging -d c:\storage\datastore q:\backup_storage_1\datastore z:\backup_storage_2\file_data -t c:\storage\checksumdata q:\backup_storage_1\checksumdata z:\backup_storage_2\meta_data\checksums -m c:\storage\manifest.sha256 q:\backup_storage_\manifest.sha256 z:\backup_storage_2\meta_data\manifest.sha256
This will process all files in f:\staging
and create output in the following locations:
Destination | Files | Checksums | Manifest |
1 | c:\storage\datastore |
c:\storage\checksumdata |
c:\storage\manifest.sha256 |
2 | q:\backup_storage_1\datastore |
q:\backup_storage_1\checksumdata |
q:\backup_storage_1\manifest.sha256 |
3 | z:\backup_storage_2\file_data |
z:\backup_storage_2\meta_data\checksums |
z:\backup_storage_2\meta_data\manifest.sha256 |
The following options can be used with all modes of operation. They should be used in the command line before the mode of operation (e.g. create, stage, etc) is specified.
Use the -p
or --num-processes
option to specify the number of concurrent processes MPT should use. The
default value is 2. The ideal number will depend on the number of CPUs and processor cores the host machine has.
Use the -e
or --email-results
option to specify e-mail recipients for MPT's summary reports.
Use the -o
or --output
option to specify the root directory used to store reports. Subdirectories will be
created beneath this directory for each type of report (creation, validation, comparison and staging), and a separate
dated directory will be created each time MPT runs.
Normally MPT will count the number of files to be processed before it starts. When run interactively, this can provide
a useful picture of its progress - however, this is at the cost of potentially taking a long time to begin processing,
as all files have to be counted before processing can begin. Use the --no-count
option to skip file counting
and simply display a count of how many files have been processed so far.
By default, the summary reports produced by MPT show each file's path relative to the root directory specified
on the command line. Use the --absolute-path
option to instead show an absolute path. Note that this may include
a drive letter (on Windows) or mount point (on Linux) which does not exist for all users.
MPT produces its output reports as it is running. By default, it caches 1000 records in memory before writing them to
disk. To override this setting, use the --cache-size
option to specify a different number of records. A higher
value will result in higher memory usage, whereas a lower number will cause more frequent writing to disk. Depending on
the number of files being processed by MPT, adjustments to the cache size may improve overall performance.
mpt --email-results recipient@example.com recipient2@example.com --num-processes 8 --no-count --cache-size 0 --output c:\storage\reports validate_tree c:\storage\files --tree c:\storage\checksums
This will validate the files stored in c:\storage\files
using the checksum tree in c:\storage\checksums
,
using 8 concurrent processes and without counting the files to be processed. Results will be written out to disk
immediately rather than being cached. The resulting reports will be written to the directory c:\storage\reports
and sent via e-mail to the two listed recipients.
This project is licensed under the Apache License 2.0. For details see the accompanying LICENSE file or visit:
http://www.apache.org/licenses/LICENSE-2.0
Copyright (c) 2020, The British Library