Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clientpath to Filesets #12

Merged
merged 4 commits into from
Oct 31, 2023
Merged

Add clientpath to Filesets #12

merged 4 commits into from
Oct 31, 2023

Conversation

will-moore
Copy link
Member

@will-moore will-moore commented Oct 11, 2023

Since existing FilesetEntry.clientpath values are set to unknown for mkngff Filesets, and we also don't have any reference to the original source of the data, we can set this value to something more useful.

This PR adds a --clientpath option which is a path or URL to the Fileset e.g. https://s3-server/bucket/data.zarr that corresponds to the mounted s3 Fileset /dir/path/to/data.zarr.
This enables the creation of a clientpath for every file found under the mounted Fileset.

E.g.

$ omero mkngff sql 4053141 --clientpath=https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr --secret=$SECRET /bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr > 4053141.sql

This creates sql output with a 4th clientpath item in each sql ROW. If the --clientpath option is not used as above then the placeholder unknown is added to each ROW in the sql, which results in the same outcome as before.

Tested at IDR/idr-utils#56 (comment)

@will-moore
Copy link
Member Author

idr0004,Screen:202,S-BIAD867
idr0010,Screen:1351,S-BIAD885
idr0011,Screen:1501,S-BIAD866
idr0011,Screen:1551,S-BIAD866
idr0011,Screen:1601,S-BIAD866
idr0011,Screen:1602,S-BIAD866
idr0011,Screen:1603,S-BIAD866
idr0012,Screen:1202,S-BIAD845
idr0013,Screen:1101,S-BIAD865
idr0013,Screen:1302,S-BIAD865
idr0015,Screen:1201,S-BIAD861
idr0016,Screen:1251,S-BIAD851
idr0025,Screen:1851,S-BIAD846
idr0026,Project:301,S-BIAD860
idr0033,Screen:1751,S-BIAD848
idr0035,Screen:2001,S-BIAD847
idr0036,Screen:1952,S-BIAD855
idr0051,Project:552,S-BIAD815
idr0054,Project:701,S-BIAD800
idr0090,Screen:2851,S-BIAD882
idr0091,Dataset:1351,S-BIAD852
pip install 'omero-mkngff @ git+https://github.com/will-moore/omero-mkngff@clientpath'
# 1 plate from idr0004
omero mkngff clientpath Plate:1751 https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/

# all of idr0004
omero mkngff clientpath Screen:202 https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/

# csv above...
for r in $(cat ngff_filesets.csv); do
  target=$(echo $r | cut -d',' -f2)
  biad=$(echo $r | cut -d',' -f3)
  echo $target
  omero mkngff clientpath $target "https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/$biad/"
done

@will-moore
Copy link
Member Author

After running for nearly 8 hours, we have reached 14 plates into idr0012, (approx 400 plates done) so it will be at least another day before this is complete!
This seems the wrong way to go when we've only just generated the filesets.

@joshmoore I wonder if we could teach the sql function mkngff_fileset() to populate the clientpath as in the description above? The trouble is that we don't want to regenerate all the sql files from scratch, although we could add in the base URL for a Fileset e.g. https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/103d9428-b86b-4f4e-84d8-966b5d89aae1/103d9428-b86b-4f4e-84d8-966b5d89aae1.zarr into the parameter list.

Then, for each row in the array, e.g.

['demo_2/2015-10/01/07-25-30.185_mkngff/103d9428-b86b-4f4e-84d8-966b5d89aae1.zarr/A/10/0/3/', '.zarray', 'application/octet-stream'],

we'd need to be able to generate the clientpath within the mkngff_fileset() function, possibly using .zarr to split the path here to get the relative path A/10/0/3, to add to the base URL along with the name to get:
e.g. https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/103d9428-b86b-4f4e-84d8-966b5d89aae1/103d9428-b86b-4f4e-84d8-966b5d89aae1.zarr/A/10/0/3/.zarray

Is that possible within sql language?

@will-moore
Copy link
Member Author

will-moore commented Oct 13, 2023

Still running...

Fileset 6312826
tosave 3061
Fileset 6312697
tosave 3061

This is taking 3 minutes per Fileset just now....

@will-moore
Copy link
Member Author

get_filesets Screen:1251
Fileset 6313488
tosave 14610

@joshmoore
Copy link
Member

Is that possible within sql language?

I'm not sure I fully understand but in general you can do anything with SQL if slightly more verbosely.

I like your idea of templating the output, but there would still need to be checks for the existence of the files, no?

@will-moore
Copy link
Member Author

Having experimented with trying this in mkngff_fileset() function within setup.sql script I have given up and I'm going to simply pass the clientpath argument as a 4th item for each row that creates an OriginalFile.

This also means that we don't need the complex logic to resolve clientpath from path and name.

e.g.

$ omero mkngff sql 1591301 --clientpath="https://s3/path/to/image.zarr" /path/to/data/6001247.zarr

Found prefix: demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023 for fileset: 1591301

UPDATE pixels SET name = '.zattrs', path = 'demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr' where image in (select id from Image where fileset = 1591301);

begin;
    select mkngff_fileset(
      1591301,
      'SECRETUUID',
      'cdf35825-def1-4580-8d0b-9c349b8f78d6',
      'demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/',
      array[
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/', '.zattrs', 'application/octet-stream', 'https://s3/path/to/image.zarr/.zattrs'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/', '.zgroup', 'application/octet-stream', 'https://s3/path/to/image.zarr/.zgroup'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/0/', '.zarray', 'application/octet-stream', 'https://s3/path/to/image.zarr/0/.zarray'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/1/', '.zarray', 'application/octet-stream', 'https://s3/path/to/image.zarr/1/.zarray'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/2/', '.zarray', 'application/octet-stream', 'https://s3/path/to/image.zarr/2/.zarray']
      ]::text[][]
    );
commit;

@will-moore
Copy link
Member Author

Tested at IDR/idr-utils#56 (comment)
with:

omero mkngff sql 4053141 --clientpath=https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr --secret=$SECRET /bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr > 4053141.sql

Re: @joshmoore "checks for the existence of the files" - I'm not sure what you mean, but in that example the clientpath values are set to files under https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr, but we don't check for their existence.

@will-moore will-moore changed the title Add clientpath command Add clientpath to Filesets Oct 24, 2023
@joshmoore joshmoore merged commit e874ca3 into IDR:main Oct 31, 2023
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants