Batch export sower job #29

mcannalte · 2021-08-12T06:53:12Z

New Features

Adds a sower job which accepts an input of
{ "study_ids": [ "acb-xyz" ] }
and constructs a manifest of the files in these studies, downloads and compresses them, and uploads them to a user-specific location in s3 (which is overwritten each time a user initiates a download)

mfshao · 2021-08-12T17:23:34Z

export_job/export_zip_from_manifest_job.py

+        describe_access_to_files_in_workspace_manifest(
+            hostname, auth, MANIFEST_FILENAME
+        )
+        download_files_in_workspace_manifest(


FYI just making a note here (since the DRS download python SDK branch doesn't have a PR now): I think the way that SDK implement this currently is to gather all the access token from all related commons first, and then iterate through the manifest to download files.

There is a possiblilty that if the manifest is large or for whatever reasons some downloads took longer than 20 mins, then some access token will expire, resulting in failure in subsequent downloads.

Probably not a big deal for now becaus we have a relatively small file size limits

Yes the pre-caching of WTS is a known issue and needs to be addressed. I will highlight this in the related feature doc.

mcannalte · 2021-08-13T15:22:59Z

Note: this job replaces the (previously assumed) need for a separate file downloads API for the sake of HEAL. However, Sower's limitation on input data requires that input data fit in an environment variable, hence we send a list of study IDs rather than an entire manifest. If a manifest, DRS bundle, etc is preferable for another use case, this job is easily extensible to use a user-specific s3 file as a job input instead. Batch-export cloud automation is setup with a bucket for this case.

williamhaley · 2021-08-13T21:08:35Z

export_job/export_zip_from_manifest_job.py

+    for study in study_metadata:
+        if not study["__manifest"]:
+            print(
+                f"Study {study['project_number']} is missing __manifest entry. Skipping."


Is the key configurable? Might it be project_number for some entries, but study_id for others?

Might be worth removing this key altogether, since it's only ever needed for this informational log

Yeah, it might be safer to not assume the content/structure of the data since the MDS really allows for any shape. I think you're right that project_number is most common. Maybe best to just set up that log line so it wouldn't error/crash if that field was missing

williamhaley · 2021-08-13T21:09:29Z

export_job/export_zip_from_manifest_job.py

+    # print(f"Retrieved metadata: {study_metadata}")
+    manifest = []
+    for study in study_metadata:
+        if not study["__manifest"]:


Do we make it configurable which field name we use? Is it __manifest for some and __files for others? Not saying it's good if it's configurable, but just checking. I wonder if we need to query the gitops info to see what the field name is

For HEAL, I think it's always been __manifest : https://github.com/uc-cdis/cdis-manifest/blob/369b80452adb4a91fdef7fd1b6f2e283c12168e6/healdata.org/portal/gitops.json#L126
At least that's how we've been building manifests from the UI

But I can change this to be passed as an env var from sower, not sure if there's a way to avoid that being a duplicate configuration

You could use put some bash script in your cloud-auto PR to extract that from portal config and put it into a k8s configmap or secrets and mount them

Edit: oops didn't see you just merged that cloud-auto PR

If I search across our repos I only see us ever setting manifestFieldName to __manifest. I think the best thing to do would be to not have that field configurable, keep it hard-coded here, and take it out of configurations. I think making that field configurable just adds more headache especially if it's consistent everywhere

Yeah I remember I have mention to have a set of "pre-defined" field name rules for MDS data that serves as a convention enforced over all MDS

This way we can reduce this kind of configuration complexity by a lot

williamhaley · 2021-08-13T21:11:36Z

export_job/export_zip_from_manifest_job.py

+        aws_secret_access_key=aws_secret_access_key,
+    )
+
+    export_key = f"{username}-export.zip"


Is there a risk/possibility that an IDP may allow a user to be authenticated, but no username is set? Or could the username have special characters that don't play well with S3 or various file systems?

These are both good questions, but I'm not sure what fields an IDP requires or what characters they can take on. Can you think of a more appropriate way to name exports? I'm brainstorming...

A guid might be a nice way to simply make file names unique. That also allows us to not have to worry about storing names/identifying information in S3 or anywhere else

A timestamp + guid is also nice because it allows us to see how old exports are and expire them while enforcing uniqueness

ZakirG

Works well, good job ~

williamhaley

Just a couple minor notes, but lgtm!

…roject id ref

mcannalte changed the title ~~Feat/zip~~ Batch export sower job Aug 12, 2021

feat(zip): add zip export job

1a922a7

mcannalte force-pushed the feat/zip branch from 47217e3 to 1a922a7 Compare August 12, 2021 15:36

mcannalte marked this pull request as ready for review August 12, 2021 16:08

nmetokishlubsky previously approved these changes Aug 12, 2021

View reviewed changes

mfshao reviewed Aug 12, 2021

View reviewed changes

feat(zip): cleanup error messages to be user facing

ba70c21

mcannalte dismissed nmetokishlubsky’s stale review via ba70c21 August 13, 2021 07:36

mcannalte mentioned this pull request Aug 13, 2021

Add UI components and polling for zip downloads uc-cdis/data-portal#906

Merged

mcannalte requested review from ZakirG and williamhaley August 13, 2021 15:22

williamhaley reviewed Aug 13, 2021

View reviewed changes

ZakirG previously approved these changes Aug 13, 2021

View reviewed changes

williamhaley previously approved these changes Aug 13, 2021

View reviewed changes

mfshao previously approved these changes Aug 13, 2021

View reviewed changes

feat(zip): only py files in Dockerfile, url encode username, remove p…

7b631b5

…roject id ref

mcannalte dismissed stale reviews from mfshao, williamhaley, and ZakirG via 7b631b5 August 16, 2021 19:51

williamhaley previously approved these changes Aug 16, 2021

View reviewed changes

feat(zip): remove local data on job success/fail

c54d993

mcannalte dismissed williamhaley’s stale review via c54d993 August 16, 2021 21:14

mcannalte merged commit 8798379 into master Jan 6, 2022

mcannalte deleted the feat/zip branch January 6, 2022 21:00

This was referenced Jan 10, 2022

Gen3 Monthly Release 2022.01 acct.bionimbus.org 1641853042 uc-cdis/cdis-manifest#3982

Merged

Gen3 Monthly Release 2022.01 dataguids.org 1641863795 uc-cdis/cdis-manifest#3986

Closed

PlanXCyborg mentioned this pull request Jan 20, 2022

Deploying 2021.12 ibdgc.datacommons.io 1642692655 -l gen3-release uc-cdis/cdis-manifest#4058

Merged

PlanXCyborg mentioned this pull request Feb 3, 2022

Gen3 Monthly Release 2022.02 dataguids.org 1643908147 uc-cdis/cdis-manifest#4119

Closed

PlanXCyborg mentioned this pull request Mar 2, 2022

Gen3 Monthly Release 2022.03 dataguids.org 1646179627 uc-cdis/cdis-manifest#4228

Merged

This was referenced Mar 21, 2022

g3dc 2022.03 gen3.datacommons.io 1647895607 uc-cdis/cdis-manifest#4346

Merged

g3dc 2022.03 internal-icgc.bionimbus.org 1648239285 uc-cdis/cdis-manifest#4355

Merged

g3dc 2022.03 icgc.bionimbus.org 1648239323 uc-cdis/cdis-manifest#4356

Merged

This was referenced Apr 30, 2022

Gen3 Monthly Release 2022.05 tbi.datacommons.io 1651309735 uc-cdis/cdis-manifest#4576

Closed

Gen3 Monthly Release 2022.05 gen3-neuro.datacommons.io 1651313335 uc-cdis/cdis-manifest#4577

Closed

This was referenced Jun 23, 2022

Gen3 Monthly Release 2022.06 tbi.datacommons.io 1655955061 uc-cdis/cdis-manifest#4792

Closed

Gen3 Monthly Release 2022.06 gen3-neuro.datacommons.io 1655958659 uc-cdis/cdis-manifest#4793

Closed

PlanXCyborg mentioned this pull request Jul 21, 2022

Update qa-anvil to 2022.07 qa-anvil.planx-pla.net - 2022.07 1658416156 uc-cdis/gitops-qa#2107

Open

PlanXCyborg mentioned this pull request Sep 14, 2022

chore/UpdateBloodpac: update servaces to match production versions uc-cdis/gitops-qa#2187

Merged

PlanXCyborg mentioned this pull request Oct 21, 2022

Gen3 Monthly Release 2022.10 dcf-interop.kidsfirstdrc.org 1666320713 uc-cdis/cdis-manifest#5281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch export sower job #29

Batch export sower job #29

mcannalte commented Aug 12, 2021 •

edited

Loading

mfshao Aug 12, 2021 •

edited

Loading

craigrbarnes Aug 13, 2021

mcannalte commented Aug 13, 2021

williamhaley Aug 13, 2021

mcannalte Aug 13, 2021

williamhaley Aug 13, 2021

williamhaley Aug 13, 2021

mcannalte Aug 13, 2021

mfshao Aug 13, 2021 •

edited

Loading

williamhaley Aug 13, 2021

mfshao Aug 13, 2021 •

edited

Loading

williamhaley Aug 13, 2021

mcannalte Aug 13, 2021

williamhaley Aug 13, 2021 •

edited

Loading

ZakirG left a comment

williamhaley left a comment

Batch export sower job #29

Batch export sower job #29

Conversation

mcannalte commented Aug 12, 2021 • edited Loading

New Features

mfshao Aug 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcannalte commented Aug 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfshao Aug 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfshao Aug 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamhaley Aug 13, 2021 • edited Loading

Choose a reason for hiding this comment

ZakirG left a comment

Choose a reason for hiding this comment

williamhaley left a comment

Choose a reason for hiding this comment

mcannalte commented Aug 12, 2021 •

edited

Loading

mfshao Aug 12, 2021 •

edited

Loading

mfshao Aug 13, 2021 •

edited

Loading

mfshao Aug 13, 2021 •

edited

Loading

williamhaley Aug 13, 2021 •

edited

Loading