-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature-73: store_object
Refactor (with References)
#74
Conversation
…nd subdirectories
…ests, and '_store_data' to 'store_data_only'
…efs' and add new empty method '_write_cid_reference'
…and update pytests
… revise all pytests
@doulikecookiedough This looks good, everything looks really solid from a Python perspective. After testing and poking around for an hour I don't see anything major that needs to change for this PR. Two observations:
For good measure I am adding myself as a reviewer and approving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #74 (comment)
""" | ||
if ( | ||
not isinstance(data, str) | ||
and not isinstance(data, Path) | ||
and not isinstance(data, io.BufferedIOBase) | ||
): | ||
exception_string = ( | ||
"FileHashStore - store_object: Data must be a path, string or buffered" | ||
"FileHashStore - _validate_arg_data: Data must be a path, string or buffered" | ||
+ f" stream type. Data type supplied: {type(data)}" | ||
) | ||
logging.error(exception_string) | ||
raise TypeError(exception_string) | ||
if isinstance(data, str): | ||
if data.replace(" ", "") == "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why repeat this code here, when you already have a function that does this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github review isn't playing nicely with me, but I found a redundant method _validate_arg_metadata
which has been deleted. store_metadata
now calls _validate_arg_data
instead.
@@ -1079,41 +1583,7 @@ def _validate_algorithms_and_checksum( | |||
checksum_algorithm_checked = self.clean_algorithm(checksum_algorithm) | |||
return additional_algorithm_checked, checksum_algorithm_checked | |||
|
|||
def _refine_algorithm_list(self, additional_algorithm, checksum_algorithm): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lines 1574 and 1580, above:
"validate_checksum_args (store_object)",
should be _validate_arg_algorithms_and_checksum
?
Is there not a better way to get the calling method's name for logging purposes automatically, instead of relying on the developer to remember to update the string (which is already not working well 😁)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a way to do this in the python standard library's inspect
module - however, it isn't exactly "plug and play" (example here). While the current solution isn't very elegant (and relies on the developer's attention level...), I think it is acceptable for now. But if you think it's still better for us to address it, please let me know and I'll create a separate issue! Thank you
…ace() to account for spaces, tabs and newline characters
…curacy and refactor accordingly
Thank you @artntek and @iannesbitt for reviewing my pull request. I believe I have addressed all the feedback and am proceeding to merge into develop. If there's anything you want me to review further, please let me know and I'll open a new issue to discuss. |
This pull request represents the changes required for HashStore to integrate into Metacat - where a multipart request is used to upload an object, and it's respective parts (ex. the data object, form, metadata, etc.) can arrive in a different order with each request. If the data object comes first - we need to be able to store it without providing a
pid
. Currently, this is not possible asstore_object
requires apid
argument.As a result, HashStore has been refactored to allow
store_object
to be called without supplying apid
. Additionally, objects are stored by their content identifiers (based on the HashStore default store algorithm). This is a switch back to our original proposed design, with the primary difference being the process in which we manage where the content identifier (cid
) of the object is located/referenced so that it can be found. Previously, thecid
was stored with the sysmeta (metadata document) of the object in the metadata directory. In this refactor, data objects and their respective references are managed via references files in the.../refs/pid/
and.../refs/cid/
folder.cid
as the permanent address was made to simplify the process of storing an object. This way, we do not need to store objects into temporary files, hold the name and then have a new commit process to move the object when it's "ready". Objects are stored once, and deleted when the client determines to do so.A reference file for a
pid
is stored in.../refs/pid/
with the permanent address being the sharded (sha256) hash of thepid
, and contains thecid
of the object it references. Apid
ref file can only contain onecid
. A reference file for acid
is stored in.../refs/cid/
with the permanent address being the shardedcid
itself, and the contents being a list of pids delimited with new lines (\n
). So to find an object, you would callfind_object(pid)
which will return thecid
(string). Deleting an object will delete itspid
reference, and also remove it from its respectivecid
reference file.find_object
cid
to prevent accidental deletions.cid
reference file is empty, and likewise with thecid
ref file itself.delete_object(pid)
will first remove itspid
from thecid reference file
, delete thecid_reference_file
if its empty, then delete itspid reference file
and lastly, the object itself only if thecid_reference_file
was successfully deleted.In conclusion, there will be two paths to store an object:
store_object(pid=None, data)
with just the datacid
being the permanent addressverify_object(object_metadata, checksum, checksum_algorithm, expected_file_size)
)tag_object(pid, cid)
pid
-store_object(pid, data, ...)
store_object
with the pid (and relevant additional parameters) will not only store the object, but also tag and verify the object. This is an all-in-one method if we receive the form data before the object.Summary:
cid
as the permanent addressstore_object
has been refactored to allow for storing data only..../refs
directory which houses the/cid/..
and/pid/..
references along with the supporting methods and tests to facilitate the tagging process.tag_object
,find_object
verify_object
, but after describing this pull request, feels like it should be added. I would like to get some feedback here to confirm its inclusion.delete_object
,retrieve_object
andget_hex_digest
Public API methods have also been updated to reflect the recent changes@iannesbitt and @artntek - Could you guys please help review this pull request when you have scope?
@mbjones - If you have time, I would appreciate some feedback as well.