Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL support functionality notes #3

Open
yarikoptic opened this issue Nov 8, 2019 · 11 comments
Open

URL support functionality notes #3

yarikoptic opened this issue Nov 8, 2019 · 11 comments

Comments

@yarikoptic
Copy link
Collaborator

yarikoptic commented Nov 8, 2019

Here is a protocol from running `git annex addurl` on `s3://` url which is handled via datalad special remote:
$> git annex initremote datalad type=external externaltype=datalad encryption=none 
...
$> git annex addurl --pathdepth=-1 --debug s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
...
[2019-11-08 10:50:13.671664748] git-annex-remote-datalad[1] <-- CLAIMURL s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:13.671791652] git-annex-remote-datalad[1] --> DEBUG Encodings: filesystem utf-8, default utf-8
[2019-11-08 10:50:13.67187777] Encodings: filesystem utf-8, default utf-8
[2019-11-08 10:50:13.672111287] git-annex-remote-datalad[1] --> DEBUG Claiming url 's3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt'
[2019-11-08 10:50:13.672210539] Claiming url 's3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt'
[2019-11-08 10:50:13.672270751] git-annex-remote-datalad[1] --> CLAIMURL-SUCCESS
[2019-11-08 10:50:13.67239921] git-annex-remote-datalad[1] <-- CHECKURL s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:14.040837149] git-annex-remote-datalad[1] --> CHECKURL-CONTENTS 4250 ds116/sub001/BOLD/task001_run001/QA/fd.txt
addurl s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt (from datalad) (to ds116/sub001/BOLD/task001_run001/QA/fd.txt) [2019-11-08 10:50:14.041350273] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"]
[2019-11-08 10:50:14.050977324] process done ExitSuccess
[2019-11-08 10:50:14.051141617] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"]
[2019-11-08 10:50:14.063291123] process done ExitSuccess
[2019-11-08 10:50:14.063796021] chat: git ["--git-dir=.git","--work-tree=.","check-ignore","-z","--stdin","--verbose","--non-matching"]

[2019-11-08 10:50:14.085449929] git-annex-remote-datalad[1] <-- TRANSFER RETRIEVE URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt .git/annex/tmp/URL-s4250--s3&c%%openfmri%ds116%sub001%BOLD%task001_run001%QA%fd.txt
[2019-11-08 10:50:14.087122748] git-annex-remote-datalad[1] --> GETURLS URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt http:
[2019-11-08 10:50:14.090879728] git-annex-remote-datalad[1] <-- VALUE 
[2019-11-08 10:50:14.09178823] git-annex-remote-datalad[1] --> GETURLS URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt https:
[2019-11-08 10:50:14.094266045] git-annex-remote-datalad[1] <-- VALUE 
[2019-11-08 10:50:14.094895422] git-annex-remote-datalad[1] --> GETURLS URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt s3:
[2019-11-08 10:50:14.098633146] git-annex-remote-datalad[1] <-- VALUE s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:14.098871334] git-annex-remote-datalad[1] <-- VALUE 
[INFO] Downloading 's3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt' into '.git/annex/tmp/URL-s4250--s3&c%%openfmri%ds116%sub001%BOLD%task001_run001%QA%fd.txt' 
[2019-11-08 10:50:14.245314399] git-annex-remote-datalad[1] --> PROGRESS 4250
100%  4.15 KiB         26 KiB/s 0s[2019-11-08 10:50:14.246005387] git-annex-remote-datalad[1] --> TRANSFER-SUCCESS RETRIEVE URL-s4250--s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt
[2019-11-08 10:50:14.24666711] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","check-attr","-z","--stdin","annex.backend","annex.numcopies","annex.largefiles","--"]
[INFO] Successfully downloaded s3://openfmri/ds116/sub001/BOLD/task001_run001/QA/fd.txt into .git/annex/tmp/URL-s4250--s3&c%%openfmri%ds116%sub001%BOLD%task001_run001%QA%fd.txt 
ok
(recording state in git...)
...

So the workflow would be

  • react with CLAIMURL-SUCCES for CLAIMURL those urls which start with globus://[<globus-name>|<globus-uuid>]/<fileprefix> where <globus-name> (or <globus-uuid>) and <fileprefix> are options of the special remote
  • provide CHECKURL-CONTENTS Size|UNKNOWN Filename response for CHECKURL query by annex (for those matching URLs)
  • on TRANSFER RETRIEVE Key File we could analyze the provided Key.
    1. If Key is of URL- backend (starts with URL-) we could actually avoid using GETURLS (as we did in datalad) but just parse that key to extract the URL and corresponding path to be RETRIEVED
    2. while transferring we need to provide back PROGRESS reports
    3. the tricky part here is that for RETRIEVE we would like to support also regular/proper git annex special remote behavior, if data was stored in the layout of a regular special remote. So we might want first to check (yet to determine specifics) if a key is available as on regular annex special remote; and if not /no information -- run GETURLS, filter for the ones we care about and try to download using them instead
    4. instead of i. + iii. -- may be we should just do GETURLS, regardless of the key, and only if that doesn't provide us any URLs we handle, then get to "regular special remote" way.

but overall summary -- we should be able to make it work as a proper git annex external special remote with GET/PUT and EXPORT while also supporting regular annex addurl globus://... functionality (thus datalad addurs could be used to establish "import" of already existing directories on globus; until git annex provides protocol/support for proper "import").

BUT I believe that git-annex might be the one which seems to "register url" (thus storing the ad-hoc globus:// url in git-annex branch in .web file for the key) for the key upon addurl URL. Ideally we should avoid that url being stored, but rather just store the path to the file (and version info) to that file assuming globus://[<globus-name>|<globus-uuid>]/<fileprefix> prefix. that would allow for more flexible management (e.g. rename of the globus endpoint, or renaming/moving fileprefix), and minimize storage within git-annex branch. We might need to clarify that with @joeyh (Q: is it possible for special remote to announce that claimed url shouldn't be stored as a url for the file)

@joeyh
Copy link

joeyh commented Nov 11, 2019 via email

@gi114
Copy link
Collaborator

gi114 commented Nov 15, 2019

Hi,
Here I have a snap of what addurl is doing at the moment. The CLAIM recognizes the globus:// 'artificial' protocol that we discussed for Globus. It seems though that the whole process does not yet save the .git/annex/tmp/tmp_url in the git annex branch as it should before GETURLS is triggered.
I am working on that now and I am trying to find a way to do that, so let me know if you have any suggestions on how to achieve that or on any modifications you think I should make.

Screenshot from 2019-11-14 15-36-58

@gi114
Copy link
Collaborator

gi114 commented Nov 15, 2019

Actually, here I have GETURL finding a 'globus://' url:

[2019-11-15 10:17:46.923324113] git-annex-remote-globus[1] --> GETURLS URL-s572--globus://8ca92f91-39fb-4176-bcb-c3b94a808a2c79d140e7725fef792609 globus://
[2019-11-15 10:17:46.925082699] git-annex-remote-globus[1] <-- VALUE globus://8ca92f91-39fb-4176-bcb9-7fb1ed53114b/5/published/publication_170/submitted_data/2015_11_18_cortex/mask/mask.mat
[2019-11-15 10:17:46.925231767] git-annex-remote-globus[1] <-- VALUE 
[2019-11-15 10:17:46.925513495] git-annex-remote-globus[1] --> TRANSFER-SUCCESS RETRIEVE URL-s572--globus://8ca92f91-39fb-4176-bcb-c3b94a808a2c79d140e7725fef792609
[2019-11-15 10:17:46.931256428] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","check-attr","-z","--stdin","annex.backend","annex.numcopies","annex.largefiles","--"]
[2019-11-15 10:17:46.932642051] read: git ["--version"]
[2019-11-15 10:17:46.938621795] process done ExitSuccess

git-annex: .git/annex/tmp/URL-s572--globus&c%%8ca92f91-39fb-4176-bcb-c3b94a808a2c79d140e7725fef792609: getFileStatus: does not exist (No such file or directory)
failed

Nevertheless, I am missing the step where this url is downloaded into the tmp_url which is supposed to be stored in .git/annex/tmp.

@gi114
Copy link
Collaborator

gi114 commented Nov 15, 2019

For some reason the file type does not get propagated:
.git/annex/tmp/URL-s572--globus&c%%8ca92f91-39fb-4176-bcb-c3b94a808a2c79d140e7725fef792609: getFileStatus: does not exist (No such file or directory)

Which should be: .git/annex/tmp/URL-s572--globus&c%%8ca92f91-39fb-4176-bcb-c3b94a808a2c79d140e7725fef792609.mat

The right approach is to save the actual file content into this location, am I right? so the globus://path/to/file content should be saved in this location?

If yes, what is the use of it?

@yarikoptic
Copy link
Collaborator Author

as we talked about during weekly jitsi meeting -- just save the file into the filename git annex requested (i.e. .git/annex/tmp/URL-s572--globus&c%%8ca92f91-39fb-4176-bcb-c3b94a808a2c79d140e7725fef792609) - there is no need to adjust for anything. git-annex will later rename/move that file into the corresponding to the backend file (if was not a URL backend), and add extension of the original file if needed (i.e. for backends ending with E)

@gi114
Copy link
Collaborator

gi114 commented Nov 21, 2019

Hi,

Thank you for your message, so I tried it, but it does return a Failure at the very end:


  Downloading globus://8ca92f91-39fb-4176-bcb9-7fb1ed53114b/5/published/publication_170/submitted_data/2015_11_18_cortex/processed/2015_11_18_6_filtered0.1to10.mat into .git/annex/tmp/URL-s583568208--globus&c%%8ca92f91-39fb-4176-bcb-b4c7b06468f56af2925c0925ef087e9c
[2019-11-21 15:30:28.560758319] git-annex-remote-globus[1] --> PROGRESS 1
0%    1 B                 0 B/s[2019-11-21 15:30:28.561328338] git-annex-remote-globus[1] --> INFO Successfully downloaded globus://8ca92f91-39fb-4176-bcb9-7fb1ed53114b/5/published/publication_170/submitted_data/2015_11_18_cortex/processed/2015_11_18_6_filtered0.1to10.mat into .git/annex/tmp/URL-s583568208--globus&c%%8ca92f91-39fb-4176-bcb-b4c7b06468f56af2925c0925ef087e9c

  Successfully downloaded globus://8ca92f91-39fb-4176-bcb9-7fb1ed53114b/5/published/publication_170/submitted_data/2015_11_18_cortex/processed/2015_11_18_6_filtered0.1to10.mat into .git/annex/tmp/URL-s583568208--globus&c%%8ca92f91-39fb-4176-bcb-b4c7b06468f56af2925c0925ef087e9c
[2019-11-21 15:30:28.561719101] git-annex-remote-globus[1] --> TRANSFER-SUCCESS RETRIEVE URL-s583568208--globus://8ca92f91-39fb-4176-bcb-b4c7b06468f56af2925c0925ef087e9c
[2019-11-21 15:30:28.562750563] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","check-attr","-z","--stdin","annex.backend","annex.numcopies","annex.largefiles","--"]
[2019-11-21 15:30:28.563546251] read: git ["--version"]
[2019-11-21 15:30:28.568876865] process done ExitSuccess
ok
(recording state in git...)
[2019-11-21 15:30:29.55679333] feed: xargs ["-0","git","--git-dir=.git","--work-tree=.","--literal-pathspecs","add","--"]
[2019-11-21 15:30:29.559743577] process done ExitSuccess
[2019-11-21 15:30:29.605703139] process done ExitSuccess
[2019-11-21 15:30:29.606221091] process done ExitSuccess
[2019-11-21 15:30:29.606528966] process done ExitSuccess
[2019-11-21 15:30:29.606838115] process done ExitSuccess
[2019-11-21 15:30:29.607139805] process done ExitFailure 1

And the file content is put on my folder with the following name
8ca92f91_39fb_4176_bcb9_7fb1ed53114b_5_published_publication_170_submitted_data_2015_11_18_cortex_processed_2015_11_18_6_filtered0.1to10.mat

What do you think about this?

@joeyh
Copy link

joeyh commented Nov 22, 2019 via email

@gi114
Copy link
Collaborator

gi114 commented Nov 22, 2019

Hi,

That is the only failure message, I posted everything it logged until the end, so no more information.

It does download the file, but it places it in the local directory from where I launch the command, even if I pass the .git/annex/tmp/URL-tmp-key location I receive from annex, it does not like it. I think that, if a file extension is not provided of where to download the file (for example a .git/annex/tmp/URL-tmp-key.txt to download a globus .txt file in there), annex does not recognize it and it logs some fail and place the download in the current local directory I am in. Nevertheless, I should not modify the tmp key, it should be fine if the file extension is not provided, as discussed with @yarikoptic

I am going to investigate further. This is the only missing step left, all other remote components are done. We can start thinking if we want to export data between endpoint, or some other functionality

@joeyh
Copy link

joeyh commented Nov 24, 2019 via email

@gi114
Copy link
Collaborator

gi114 commented Nov 28, 2019

Hi @yarikoptic, I am still concerned about the checkpresent operation which is a mandatory one, for this reason: The checkpresent(key) should return a True if the file corresponding to that key is transferred_stored, as we know with transfer_store(key, filename).

The google-drive does that and in transfer_store it creates the file to be uploaded, it names it by the key and stores the content there so when he wants to retrieve it, it queries the file by the key when checkpresent(key) is called. In fact, he would then do _get_file(key) and check it exists.

Now it is the case I cannot add anything to globus and the checkpresent is an independent call, so I cannot have a cache as it gets cleared at every call. I can ask globus the file, just like _get_keys but globus does not know what that key is and where the corresponding file is because we missed the transfer_store step. In terms of checking the size, yes, the key has the size in it but again, the missing information is where the file is corresponding to that key. This is why I did implement looking into git-annex:lower_hash/key.log.web which stores the path/to/file corresponding to that key.

I think the best thing is, given the key, check in git annex branch where we have lower_hash/key.log.web. This file has the globus path that was added via addurl, so globus://id/path/to/file. so now I can ask the path to globus and check on the size and on if it is present. Nevertheless, addurl already checks there is a file in globus corresponding to that globus:// url because of claimurl and checkurl calls, and if it is the .web file is generated. Therefore I think I can check the size to make sure nothing has changed! What do you think?

Let me know, thanks again!

@gi114
Copy link
Collaborator

gi114 commented Nov 28, 2019

So no, what I will do in checkpresent(key) is to call a geturl with the key. If I do not get a globus:// url back it returns False, if I do, which is in the case I added the url, I check on the size of the file that corresponds to the one on my key to make sure nothing has changed and return True if successful. Would you agree with that? it is clean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants