Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the challenge to implement RIA store cleanups #104

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

mih
Copy link
Member

@mih mih commented Sep 28, 2023

It is not sufficient to just deal with the most basic key-operations in the ORA remote. At least when we (know the we) are operating with a filesystem on the remote.

We need to figure out, how we want to tackle this. Ideally without making unnecessary assumptions about the particular capabilties of effective URL handlers.

It is not sufficient to just deal with the most basic key-operations
in the ORA remote. At least when we (know the we) are operating with
a filesystem on the remote.

We need to figure out, how we want to tackle this. Ideally without
making unnecessary assumptions about the particular capabilties of
effective URL handlers.
@codecov-commenter
Copy link

codecov-commenter commented Sep 28, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (18e5e3a) 92.09% compared to head (bca2735) 91.97%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #104      +/-   ##
==========================================
- Coverage   92.09%   91.97%   -0.13%     
==========================================
  Files          17       17              
  Lines         405      411       +6     
  Branches       32       32              
==========================================
+ Hits          373      378       +5     
- Misses         15       16       +1     
  Partials       17       17              
Files Coverage Δ
datalad_ria/tests/test_ora.py 100.00% <100.00%> (ø)
datalad_ria/ora_remote.py 73.52% <50.00%> (-1.48%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@christian-monch
Copy link

Thanks for the code, that is pretty much along the way what I did, but a lot more to the point. Wasn't sufficiently familiar with uncurl to do it as nicely.

As far as I can see, there is one thing missing. Unless I misinterpret the code in uncurl.py, there is no mechanism to ensure that checkpresent does not return True for a given key, until the key content is completely uploaded, i.e. transfer_store is done1. That has to be ensured across different git-annex-remote-ora2 instances. In fact, it should be ensured in uncurl-instances as well. Currently checkpresent is based on URLOperations.stat. While some UrlOperation-instances might not report a file to be present during the upload of that file, others, e.g. FileUrlOperation-instances, might report it to be present during upload.

Synchronization has to be performed between different processes that are potentially executed on different hosts (although the latter is probably quite unlikely). A possible solution to this problem would be to add an upload-marker to a temporary location on the destination. The upload-marker could be the existence of a file like upload-<key>. There are a number of challenges with regard to the precise URL for such a mark, especially with the templates that uncurl supports. A possible, but ugly, solution might be to require the specification of a temporary location when initializing an uncurl-remote.

In the case of ORA2Remote, the problem is simpler. Since we control the URL-rewriting template, we could define the following location to hold upload-marker:

f'{base_url}/{dsid[:3]}/{dsid[3:]}/transfer/upload-{annex_key}'

Then we place such a marker into the location, until the upload has been performed. The usual problems of removing the marker, if the git annex remote crashes without removing the marker apply.

Another approach might be possible: if the stat-method of the UrlOperations-instance supports reliable reporting of the size of the remotely stored object. In this case would stat the remote object to check whether it is completely present on the remote. This requires that the size is always contained within the annex key.

Footnotes

  1. "CHECKPRESENT Key [...] It's important that, while a key is being transferred to a remote, CHECKPRESENT not indicate it's present in the remote until all the data has been sent." in https://git-annex.branchable.com/design/external_special_remote_protocol/

@mih
Copy link
Member Author

mih commented Sep 28, 2023

Thanks @christian-monch ! I don't want to expand the discussion here much -- it is only somewhat related to this particualr PR. But here is a related issue:
datalad/datalad-next#454

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants