Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature-73: store_object Refactor (with References) #74

Merged
merged 71 commits into from
Dec 6, 2023
Merged
Show file tree
Hide file tree
Changes from 63 commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
6e5bff1
Update 'store_object' method signature with default None values
doulikecookiedough Nov 9, 2023
efc116e
Add new private method '_store_data'
doulikecookiedough Nov 9, 2023
3831bac
Refactor 'store_object' to only store data when pid is 'None'
doulikecookiedough Nov 9, 2023
90c3239
Refactor '_move_and_get_checksums' to store objects with their conten…
doulikecookiedough Nov 10, 2023
f6a5cd1
Update tests after store_object refactor to store with object's conte…
doulikecookiedough Nov 10, 2023
5a69001
Update HashStore interface documentation
doulikecookiedough Nov 10, 2023
d18eba9
Add new public API methods to HashStore interface 'tag_object' and 'f…
doulikecookiedough Nov 10, 2023
78f84c3
Update HashStore initialization to create required 'refs' directory a…
doulikecookiedough Nov 10, 2023
0af3514
Add TODOs and pseudo code in 'FileHashStore'
doulikecookiedough Nov 10, 2023
6a32c45
Rename 'put_object' method to '_store_and_validate_data' and update t…
doulikecookiedough Nov 10, 2023
270e556
Add reference locks and skeleton code for 'tag_object'
doulikecookiedough Nov 10, 2023
d20f41e
Fill out 'tag_object' skeleton, update 'get_store_path' method for 'r…
doulikecookiedough Nov 10, 2023
73a2c66
Fix test for 'build_abs_path' in FileHashStore
doulikecookiedough Nov 10, 2023
f44871e
Add new fcntl import, code method 'write_cid_reference' and add new p…
doulikecookiedough Nov 10, 2023
4c01344
Add missing assertion statement to 'write_cid_reference' test for ver…
doulikecookiedough Nov 11, 2023
0178f28
Add new method 'update_cid_reference' with new pytests
doulikecookiedough Nov 11, 2023
21233af
Add new 'delete_cid_reference_pid' method with new pytests
doulikecookiedough Nov 11, 2023
1fc158f
Rename refs related method names and update pytests
doulikecookiedough Nov 11, 2023
b7833f0
Add new 'delete_cid_refs_file' method with new pytests
doulikecookiedough Nov 11, 2023
d4e8274
Add missing docstring for 'tag_object'
doulikecookiedough Nov 11, 2023
28d4671
Add new 'write_pid_refs_file' method with new pytests
doulikecookiedough Nov 11, 2023
5689c3c
Add new pytestes for 'write_pid_refs_file' method
doulikecookiedough Nov 11, 2023
9f5cb60
Add new method 'delete_pid_refs_file' with new pytests
doulikecookiedough Nov 11, 2023
9c6509e
Update --run-slow pytests
doulikecookiedough Nov 11, 2023
2738d3f
Code 'find_object' method, missing pytests
doulikecookiedough Nov 11, 2023
2e03a6a
Fix bug in 'tag_object', refactor 'retrieve_object' method and update…
doulikecookiedough Nov 11, 2023
5b6a15f
Fix retrieve_object pytest in test_hashstore_client
doulikecookiedough Nov 11, 2023
e698c40
Refactor 'get_hex_digest' method and update pytests
doulikecookiedough Nov 11, 2023
81bb767
Initial refactor to 'delete_object' to find correct object to delete …
doulikecookiedough Nov 12, 2023
ccaa768
Add new method 'get_refs_abs_path' and refactor 'FileHashStore' and p…
doulikecookiedough Nov 12, 2023
8d44f42
Delete redundant 'get_sha256_hex_digest' method and refactor FileHash…
doulikecookiedough Nov 12, 2023
1a921a0
Refactor 'delete_object' to delete all required pid or cid reference …
doulikecookiedough Nov 12, 2023
4fee823
Synchronized 'delete_object' method with 'tag_object' method on cid v…
doulikecookiedough Nov 12, 2023
6316044
Add new pytests for 'delete_object'
doulikecookiedough Nov 12, 2023
e5b60ae
Add pytests for 'find_object' method
doulikecookiedough Nov 12, 2023
7242a62
Clean up 'filehashstore_interface' pytests
doulikecookiedough Nov 12, 2023
180d971
Add new pytests for 'tag_object' method
doulikecookiedough Nov 12, 2023
636eeff
Rename '_mktmpfile' method to '_write_to_tmp_file_and_get_hex_digests…
doulikecookiedough Nov 13, 2023
ee790eb
Extract new method '_mktmpfile' from '_write_to_tmp_file_and_get_hex_…
doulikecookiedough Nov 13, 2023
524947c
Refactor '_mktmpmetadata' method
doulikecookiedough Nov 13, 2023
d77a10e
Refactor 'write_pid_refs_file' to throw exception immediately if refs…
doulikecookiedough Nov 13, 2023
562cf79
Refactor 'tag_object' process and related methods and fix bug in 'upd…
doulikecookiedough Nov 13, 2023
59df239
Revise pytests, and extract pytests for references related processes …
doulikecookiedough Nov 13, 2023
ebfe610
Refactor 'tag_object' process to be atomic and clean up code
doulikecookiedough Nov 13, 2023
ea59f3e
Add new method '_validate_references' that is now called after atomic…
doulikecookiedough Nov 13, 2023
cef6e93
Refactor '_delete_cid_refs_file', revise pytests and add new pytests …
doulikecookiedough Nov 13, 2023
b567d5b
Add pytests for 'store_data_only'
doulikecookiedough Nov 14, 2023
348c536
Clean up comments, code and logging statements
doulikecookiedough Nov 14, 2023
3766155
Refactor '_validate_object' method and update docstrings
doulikecookiedough Nov 14, 2023
3b5275b
Add new method 'verify_object' to allow caller to validate an object'…
doulikecookiedough Nov 14, 2023
f8d1425
Clean up code and add TODO items
doulikecookiedough Nov 14, 2023
68ba2a7
Clean-up test modules' comments and docstrings and move 'tag_object' …
doulikecookiedough Nov 16, 2023
511e3e6
Revise '_update_cid_refs' and add new pytest to throw exception if fi…
doulikecookiedough Nov 16, 2023
2c4a1bd
Rename '_validate_file_size' to '_is_int_and_non_negative' for accuracy
doulikecookiedough Nov 16, 2023
38a1a3c
Update and add new pytests for '_write_cid_refs_file' method
doulikecookiedough Nov 16, 2023
a9cd611
Move info logging statements in finally blocks into try block
doulikecookiedough Nov 17, 2023
19695f0
Fix bug in 'tag_object', add new pytest and revise logging statements
doulikecookiedough Nov 17, 2023
5c9d22f
Update HashStore interface docstring for 'store_object'
doulikecookiedough Nov 17, 2023
b8d9715
Initial refactor to 'store_object', fixed bug in 'verify_object' and …
doulikecookiedough Nov 17, 2023
03f2b44
Clean up code to improve clarity
doulikecookiedough Nov 20, 2023
8536f1e
Clean up code, review tests and fix minor bugs and revise docstrings …
doulikecookiedough Nov 20, 2023
f993fb9
Add pytests for 'verify_object'
doulikecookiedough Nov 20, 2023
687df49
Refactor 'store_object' to also tag object when a pid is supplied and…
doulikecookiedough Nov 20, 2023
d8fe862
Clean up 'filehashstore' class for doc strings, typos and syntax form…
doulikecookiedough Dec 6, 2023
caa9d7b
Remove redundant method '_validate_arg_metadata' and refactor 'store_…
doulikecookiedough Dec 6, 2023
5cde868
Refactor '_is_string_none_or_empty' to call .strip() instead of .repl…
doulikecookiedough Dec 6, 2023
9db5a26
Rename method '_is_string_none_or_empty' to '_validate_string' for ac…
doulikecookiedough Dec 6, 2023
da78588
Remove redundant instance check in '_is_int_and_non_negative' method
doulikecookiedough Dec 6, 2023
2580808
Revise logging message accuracy in '_validate_arg_algorithms_and_chec…
doulikecookiedough Dec 6, 2023
7340747
Add 'verify_object' abstract method to 'HashStore' interface
doulikecookiedough Dec 6, 2023
f9a96d7
Clean up code
doulikecookiedough Dec 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,062 changes: 825 additions & 237 deletions src/hashstore/filehashstore.py
doulikecookiedough marked this conversation as resolved.
Show resolved Hide resolved

Large diffs are not rendered by default.

81 changes: 56 additions & 25 deletions src/hashstore/hashstore.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,8 @@


class HashStore(ABC):
"""HashStore is a content-addressable file management system that
utilizes a persistent identifier (PID) in the form of a hex digest
value to address files."""
"""HashStore is a content-addressable file management system that utilizes
an object's content identifier (hex digest/checksum) to address files."""

@staticmethod
def version():
Expand All @@ -26,28 +25,32 @@ def store_object(
expected_object_size,
):
"""The `store_object` method is responsible for the atomic storage of objects to
disk using a given InputStream and a persistent identifier (pid). Upon
successful storage, the method returns a ObjectMetadata object containing
relevant file information, such as the file's id (which can be used to locate the
object on disk), the file's size, and a hex digest map of algorithms and checksums.
`store_object` also ensures that an object is stored only once by synchronizing
multiple calls and rejecting calls to store duplicate objects.

The file's id is determined by calculating the SHA-256 hex digest of the
provided pid, which is also used as the permanent address of the file. The
file's identifier is then sharded using a depth of 3 and width of 2,
disk using a given stream. Upon successful storage, the method returns a ObjectMetadata
object containing relevant file information, such as the file's id (which can be
used to locate the object on disk), the file's size, and a hex digest dict of algorithms
and checksums. Storing an object with `store_object` also tags an object (creating
references) which allow the object to be discoverable.

`store_object` also ensures that an object is stored only once by synchronizing multiple
calls and rejecting calls to store duplicate objects. Note, calling `store_object` without
a pid is a possibility, but should only store the object without tagging the object.
It is then the caller's responsibility to finalize the process by calling `tag_object`
after veriftying the correct object is stored.

The file's id is determined by calculating the object's content identifier based on
the store's default algorithm, which is also used as the permanent address of the file.
The file's identifier is then sharded using the store's configured depth and width,
delimited by '/' and concatenated to produce the final permanent address
and is stored in the `/store_directory/objects/` directory.

By default, the hex digest map includes the following hash algorithms:
Default algorithms and hex digests to return: md5, sha1, sha256, sha384, sha512,
which are the most commonly used algorithms in dataset submissions to DataONE
and the Arctic Data Center. If an additional algorithm is provided, the
`store_object` method checks if it is supported and adds it to the map along
with its corresponding hex digest. An algorithm is considered "supported" if it
is recognized as a valid hash algorithm in the `hashlib` library.

Similarly, if a file size and/or checksum & checksumAlgorithm value are provided,
md5, sha1, sha256, sha384, sha512 - which are the most commonly used algorithms in
dataset submissions to DataONE and the Arctic Data Center. If an additional algorithm
is provided, the `store_object` method checks if it is supported and adds it to the
hex digests dict along with its corresponding hex digest. An algorithm is considered
"supported" if it is recognized as a valid hash algorithm in the `hashlib` library.

Similarly, if a file size and/or checksum & checksum_algorithm value are provided,
`store_object` validates the object to ensure it matches the given arguments
before moving the file to its permanent address.

Expand All @@ -61,7 +64,36 @@ def store_object(

Returns:
object_metadata (ObjectMetadata): Object that contains the permanent address,
file size, duplicate file boolean and hex digest dictionary.
file size and hex digest dictionary.
"""
raise NotImplementedError()

@abstractmethod
def tag_object(self, pid, cid):
"""The `tag_object` method creates references that allow objects stored in HashStore
to be discoverable. Retrieving, deleting or calculating a hex digest of an object is
based on a pid argument; and to proceed, we must be able to find the object associated
with the pid.

Args:
pid (string): Authority-based or persistent identifier of object
cid (string): Content identifier of object

Returns:
boolean: `True` upon successful tagging.
"""
raise NotImplementedError()

@abstractmethod
def find_object(self, pid):
"""The `find_object` method checks whether an object referenced by a pid exists
and returns the content identifier.

Args:
pid (string): Authority-based or persistent identifier of object

Returns:
cid (string): Content identifier of the object
"""
raise NotImplementedError()

Expand Down Expand Up @@ -89,9 +121,8 @@ def store_metadata(self, pid, metadata, format_id):
@abstractmethod
def retrieve_object(self, pid):
"""The `retrieve_object` method retrieves an object from disk using a given
persistent identifier (pid). If the object exists (determined by calculating
the object's permanent address using the SHA-256 hash of the given pid), the
method will open and return a buffered object stream ready to read from.
persistent identifier (pid). If the object exists, the method will open and return
a buffered object stream ready to read from.

Args:
pid (string): Authority-based identifier.
Expand Down
3 changes: 0 additions & 3 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ def init_pids():
test_pids = {
"doi:10.18739/A2901ZH2M": {
"file_size_bytes": 39993,
"object_cid": "0d555ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e",
"metadata_cid": "323e0799524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7",
"md5": "db91c910a3202478c8def1071c54aae5",
"sha1": "1fe86e3c8043afa4c70857ca983d740ad8501ccd",
Expand All @@ -58,7 +57,6 @@ def init_pids():
},
"jtao.1700.1": {
"file_size_bytes": 8724,
"object_cid": "a8241925740d5dcd719596639e780e0a090c9d55a5d0372b0eaf55ed711d4edf",
"metadata_cid": "ddf07952ef28efc099d10d8b682480f7d2da60015f5d8873b6e1ea75b4baf689",
"md5": "f4ea2d07db950873462a064937197b0f",
"sha1": "3d25436c4490b08a2646e283dada5c60e5c0539d",
Expand All @@ -69,7 +67,6 @@ def init_pids():
},
"urn:uuid:1b35d0a5-b17a-423b-a2ed-de2b18dc367a": {
"file_size_bytes": 18699,
"object_cid": "7f5cc18f0b04e812a3b4c8f686ce34e6fec558804bf61e54b176742a7f6368d6",
"metadata_cid": "9a2e08c666b728e6cbd04d247b9e556df3de5b2ca49f7c5a24868eb27cddbff2",
"md5": "e1932fc75ca94de8b64f1d73dc898079",
"sha1": "c6d2a69a3f5adaf478ba796c114f57b990cf7ad1",
Expand Down
Loading