-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Enable duplicate detection via bag manifests #118
Open
ross-spencer
wants to merge
16
commits into
master
Choose a base branch
from
dev/issue-448-add-duplicate-reporting-mechanism
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
WIP: Enable duplicate detection via bag manifests #118
ross-spencer
wants to merge
16
commits into
master
from
dev/issue-448-add-duplicate-reporting-mechanism
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit enables duplicate detection via bag manifests in the AIP store. AIP comparison to other AIPs.
Co-Authored-By: peterVG <672121+peterVG@users.noreply.github.com>
Begin to pull parts of the code out that need to be more generic. In this commit we're starting to test other AIP compression types.
ross-spencer
force-pushed
the
dev/issue-448-add-duplicate-reporting-mechanism
branch
2 times, most recently
from
July 3, 2019 15:18
25436fc
to
aaca9fa
Compare
ross-spencer
force-pushed
the
dev/issue-448-add-duplicate-reporting-mechanism
branch
4 times, most recently
from
July 8, 2019 15:09
e3459b2
to
aa8c4c3
Compare
This commit introduces an accruals->aips comparison capability. Digital objects in an accruals folder can now be compared to the contents of an AIP store. Where filepaths and checksums and dates match, the object is considered to be identical (a true duplicate). Where they don't, users can use modulo (%) to identify where the object isn't in fact identical. Much of the benefit of this work is derived from the nature of the AIP structure imposed on a digital transfer. Once the comparison is complete, three reports are output in CSV format: * True-duplicates. * Near-duplicates (checksums match, but other components might not). * Non-duplicates. Additionally a summary report output in JSON.
ross-spencer
force-pushed
the
dev/issue-448-add-duplicate-reporting-mechanism
branch
from
July 8, 2019 15:11
aa8c4c3
to
cab6f33
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Compare an accruals location to an AIP store
This commit introduces an accruals->aips comparison capability.
Digital objects in an accruals folder can now be compared to the
contents of an AIP store.
Where filepaths and checksums and dates match, the object is
considered to be identical (a true duplicate). Where they don't,
users can use modulo (%) to identify where the object isn't in fact
identical.
Much of the benefit of this work is derived from the nature of the
AIP structure imposed on a digital transfer.
Once the comparison is complete, three reports are output in CSV
format:
Additionally a summary report output in JSON.
Connected to archivematica/Issues#448
Configuration
API configuration, and transfer source location is done via this configuration file. Note the '"accruals_transfer_source"' parameter describes a transfer source in the storage service with the Description 'accruals'. But could equally be any other value more appropriate to your institution.
The primary script will also accept a value for this transfer source on the command line, e.g.
python3 -m duplicates.accruals <my_transfer_source_description>
With everything configured correctly the successful output on the command line may look as follows:
The CSV files output as a result can then be used to compile a list of files specifically selected to be transferred into Archivematica.