Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature-73: store_object Refactor (with References) #74

Merged
merged 71 commits into from
Dec 6, 2023

Conversation

doulikecookiedough
Copy link
Contributor

@doulikecookiedough doulikecookiedough commented Nov 21, 2023

This pull request represents the changes required for HashStore to integrate into Metacat - where a multipart request is used to upload an object, and it's respective parts (ex. the data object, form, metadata, etc.) can arrive in a different order with each request. If the data object comes first - we need to be able to store it without providing a pid. Currently, this is not possible as store_object requires a pid argument.

As a result, HashStore has been refactored to allow store_object to be called without supplying a pid. Additionally, objects are stored by their content identifiers (based on the HashStore default store algorithm). This is a switch back to our original proposed design, with the primary difference being the process in which we manage where the content identifier (cid) of the object is located/referenced so that it can be found. Previously, the cid was stored with the sysmeta (metadata document) of the object in the metadata directory. In this refactor, data objects and their respective references are managed via references files in the .../refs/pid/ and .../refs/cid/ folder.

  • Note: This switch back to using the cid as the permanent address was made to simplify the process of storing an object. This way, we do not need to store objects into temporary files, hold the name and then have a new commit process to move the object when it's "ready". Objects are stored once, and deleted when the client determines to do so.

A reference file for a pid is stored in .../refs/pid/ with the permanent address being the sharded (sha256) hash of the pid, and contains the cid of the object it references. A pid ref file can only contain one cid. A reference file for a cid is stored in .../refs/cid/ with the permanent address being the sharded cid itself, and the contents being a list of pids delimited with new lines (\n). So to find an object, you would call find_object(pid) which will return the cid (string). Deleting an object will delete its pid reference, and also remove it from its respective cid reference file.

  • Note: We discussed having an "exists" Public API method which is now the intention of find_object
  • Note: The process to tag objects and delete objects are synchronized based on the cid to prevent accidental deletions.
    • An object cannot be deleted until its cid reference file is empty, and likewise with the cid ref file itself.
    • Calling delete_object(pid) will first remove its pid from the cid reference file, delete the cid_reference_file if its empty, then delete its pid reference file and lastly, the object itself only if the cid_reference_file was successfully deleted.

In conclusion, there will be two paths to store an object:

  1. Data comes first - store_object(pid=None, data) with just the data
    • This will store the object into HashStore with its cid being the permanent address
    • The client will then have to separately verify the object when it receives the form-data with the checksum and checksum algorithm (via verify_object(object_metadata, checksum, checksum_algorithm, expected_file_size))
    • If no exceptions are thrown, the client finalizes the process by calling tag_object(pid, cid)
      object_metadata = store.store_object(pid=None, data)
      store.verify_object(object_metadata, checksum, checksum_algorithm, expected_file_size)
      store.tag_object(pid, cid)
      
  2. Form comes first, we know the pid - store_object(pid, data, ...)
    • Calling store_object with the pid (and relevant additional parameters) will not only store the object, but also tag and verify the object. This is an all-in-one method if we receive the form data before the object.
      object_metadata = store.store_object(pid, data, add_algo, checksum, checksum_algo, file_size)
      

Summary:

  • Objects are stored using their cid as the permanent address
  • store_object has been refactored to allow for storing data only.
  • There is a new .../refs directory which houses the /cid/.. and /pid/.. references along with the supporting methods and tests to facilitate the tagging process.
  • There are two new Public API methods (maybe a third): tag_object, find_object
    • I have left out verify_object, but after describing this pull request, feels like it should be added. I would like to get some feedback here to confirm its inclusion.
      • If Metacat relies on this method to complete a process, all further implementations of potential HashStores should also have it. The main reason why I didn't include it is because I thought I might be adding too much to the Public API.
    • delete_object, retrieve_object and get_hex_digest Public API methods have also been updated to reflect the recent changes
  • Various updates to the docstrings, comments and method names to improve clarity and accuracy
# Example layout in HashStore with a single file stored along with its metadata and reference files
## Notes:
## - The reference for the pids contains the cid
## - The reference for the cids contain the pids that reference the cid

/objects/
    └─ d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
/metadata/
    └─ 15/8d/7e/55c36a810d7c14479c9...b20d7df66768b04
/refs/
    └─ pid/0d/55/5e/d77052d7e166017f779...7230bcf7abcef65e
    └─ cid/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
hashstore.yaml

@iannesbitt and @artntek - Could you guys please help review this pull request when you have scope?
@mbjones - If you have time, I would appreciate some feedback as well.

…ests, and '_store_data' to 'store_data_only'
…efs' and add new empty method '_write_cid_reference'
@iannesbitt
Copy link

@doulikecookiedough This looks good, everything looks really solid from a Python perspective. After testing and poking around for an hour I don't see anything major that needs to change for this PR. Two observations:

  1. Some of the function names get quite long, but I appreciate that they are descriptive.
  2. Completing Convert docstrings to reStructuredText  #70 and adding type hints to your function definitions would make the code more readable, but it seems like you are working up to that point.

For good measure I am adding myself as a reviewer and approving.

@iannesbitt iannesbitt self-requested a review November 22, 2023 21:53
Copy link

@iannesbitt iannesbitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/hashstore/filehashstore.py Show resolved Hide resolved
"""
if (
not isinstance(data, str)
and not isinstance(data, Path)
and not isinstance(data, io.BufferedIOBase)
):
exception_string = (
"FileHashStore - store_object: Data must be a path, string or buffered"
"FileHashStore - _validate_arg_data: Data must be a path, string or buffered"
+ f" stream type. Data type supplied: {type(data)}"
)
logging.error(exception_string)
raise TypeError(exception_string)
if isinstance(data, str):
if data.replace(" ", "") == "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why repeat this code here, when you already have a function that does this?

Copy link
Contributor Author

@doulikecookiedough doulikecookiedough Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github review isn't playing nicely with me, but I found a redundant method _validate_arg_metadata which has been deleted. store_metadata now calls _validate_arg_data instead.

src/hashstore/filehashstore.py Outdated Show resolved Hide resolved
src/hashstore/filehashstore.py Show resolved Hide resolved
src/hashstore/filehashstore.py Outdated Show resolved Hide resolved
@@ -1079,41 +1583,7 @@ def _validate_algorithms_and_checksum(
checksum_algorithm_checked = self.clean_algorithm(checksum_algorithm)
return additional_algorithm_checked, checksum_algorithm_checked

def _refine_algorithm_list(self, additional_algorithm, checksum_algorithm):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 1574 and 1580, above:

"validate_checksum_args (store_object)", should be _validate_arg_algorithms_and_checksum ?

Is there not a better way to get the calling method's name for logging purposes automatically, instead of relying on the developer to remember to update the string (which is already not working well 😁)?

Copy link
Contributor Author

@doulikecookiedough doulikecookiedough Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way to do this in the python standard library's inspect module - however, it isn't exactly "plug and play" (example here). While the current solution isn't very elegant (and relies on the developer's attention level...), I think it is acceptable for now. But if you think it's still better for us to address it, please let me know and I'll create a separate issue! Thank you

src/hashstore/filehashstore.py Outdated Show resolved Hide resolved
src/hashstore/filehashstore.py Show resolved Hide resolved
src/hashstore/filehashstore.py Show resolved Hide resolved
@doulikecookiedough
Copy link
Contributor Author

Thank you @artntek and @iannesbitt for reviewing my pull request. I believe I have addressed all the feedback and am proceeding to merge into develop. If there's anything you want me to review further, please let me know and I'll open a new issue to discuss.

@doulikecookiedough doulikecookiedough merged commit 059d3b9 into develop Dec 6, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants