Feature-73: `store_object` Refactor (with References) #74

doulikecookiedough · 2023-11-21T23:14:01Z

This pull request represents the changes required for HashStore to integrate into Metacat - where a multipart request is used to upload an object, and it's respective parts (ex. the data object, form, metadata, etc.) can arrive in a different order with each request. If the data object comes first - we need to be able to store it without providing a pid. Currently, this is not possible as store_object requires a pid argument.

As a result, HashStore has been refactored to allow store_object to be called without supplying a pid. Additionally, objects are stored by their content identifiers (based on the HashStore default store algorithm). This is a switch back to our original proposed design, with the primary difference being the process in which we manage where the content identifier (cid) of the object is located/referenced so that it can be found. Previously, the cid was stored with the sysmeta (metadata document) of the object in the metadata directory. In this refactor, data objects and their respective references are managed via references files in the .../refs/pid/ and .../refs/cid/ folder.

Note: This switch back to using the cid as the permanent address was made to simplify the process of storing an object. This way, we do not need to store objects into temporary files, hold the name and then have a new commit process to move the object when it's "ready". Objects are stored once, and deleted when the client determines to do so.

A reference file for a pid is stored in .../refs/pid/ with the permanent address being the sharded (sha256) hash of the pid, and contains the cid of the object it references. A pid ref file can only contain one cid. A reference file for a cid is stored in .../refs/cid/ with the permanent address being the sharded cid itself, and the contents being a list of pids delimited with new lines (\n). So to find an object, you would call find_object(pid) which will return the cid (string). Deleting an object will delete its pid reference, and also remove it from its respective cid reference file.

Note: We discussed having an "exists" Public API method which is now the intention of find_object
Note: The process to tag objects and delete objects are synchronized based on the cid to prevent accidental deletions.
- An object cannot be deleted until its cid reference file is empty, and likewise with the cid ref file itself.
- Calling delete_object(pid) will first remove its pid from the cid reference file, delete the cid_reference_file if its empty, then delete its pid reference file and lastly, the object itself only if the cid_reference_file was successfully deleted.

In conclusion, there will be two paths to store an object:

Data comes first - store_object(pid=None, data) with just the data
- This will store the object into HashStore with its cid being the permanent address
- The client will then have to separately verify the object when it receives the form-data with the checksum and checksum algorithm (via verify_object(object_metadata, checksum, checksum_algorithm, expected_file_size))
- If no exceptions are thrown, the client finalizes the process by calling tag_object(pid, cid)
```
object_metadata = store.store_object(pid=None, data)
store.verify_object(object_metadata, checksum, checksum_algorithm, expected_file_size)
store.tag_object(pid, cid)
```
Form comes first, we know the pid - store_object(pid, data, ...)
- Calling store_object with the pid (and relevant additional parameters) will not only store the object, but also tag and verify the object. This is an all-in-one method if we receive the form data before the object.
```
object_metadata = store.store_object(pid, data, add_algo, checksum, checksum_algo, file_size)
```

Summary:

Objects are stored using their cid as the permanent address
store_object has been refactored to allow for storing data only.
There is a new .../refs directory which houses the /cid/.. and /pid/.. references along with the supporting methods and tests to facilitate the tagging process.
There are two new Public API methods (maybe a third): tag_object, find_object
- I have left out verify_object, but after describing this pull request, feels like it should be added. I would like to get some feedback here to confirm its inclusion.
  - If Metacat relies on this method to complete a process, all further implementations of potential HashStores should also have it. The main reason why I didn't include it is because I thought I might be adding too much to the Public API.
- delete_object, retrieve_object and get_hex_digest Public API methods have also been updated to reflect the recent changes
Various updates to the docstrings, comments and method names to improve clarity and accuracy

# Example layout in HashStore with a single file stored along with its metadata and reference files
## Notes:
## - The reference for the pids contains the cid
## - The reference for the cids contain the pids that reference the cid

/objects/
    └─ d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
/metadata/
    └─ 15/8d/7e/55c36a810d7c14479c9...b20d7df66768b04
/refs/
    └─ pid/0d/55/5e/d77052d7e166017f779...7230bcf7abcef65e
    └─ cid/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
hashstore.yaml

@iannesbitt and @artntek - Could you guys please help review this pull request when you have scope?
@mbjones - If you have time, I would appreciate some feedback as well.

…t identifiers

…nt identifier

…ind_object'

…nd subdirectories

…ests, and '_store_data' to 'store_data_only'

…efs' and add new empty method '_write_cid_reference'

…ytests

…ifying content

… pytests

…and update pytests

…ytests

…le is not found

…add new pytests

…and comments

… revise all pytests

iannesbitt · 2023-11-22T21:53:32Z

@doulikecookiedough This looks good, everything looks really solid from a Python perspective. After testing and poking around for an hour I don't see anything major that needs to change for this PR. Two observations:

Some of the function names get quite long, but I appreciate that they are descriptive.
Completing Convert docstrings to reStructuredText #70 and adding type hints to your function definitions would make the code more readable, but it seems like you are working up to that point.

For good measure I am adding myself as a reviewer and approving.

iannesbitt

See #74 (comment)