-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch export sower job #29
Conversation
describe_access_to_files_in_workspace_manifest( | ||
hostname, auth, MANIFEST_FILENAME | ||
) | ||
download_files_in_workspace_manifest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI just making a note here (since the DRS download python SDK branch doesn't have a PR now): I think the way that SDK implement this currently is to gather all the access token from all related commons first, and then iterate through the manifest to download files.
There is a possiblilty that if the manifest is large or for whatever reasons some downloads took longer than 20 mins, then some access token will expire, resulting in failure in subsequent downloads.
Probably not a big deal for now becaus we have a relatively small file size limits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the pre-caching of WTS is a known issue and needs to be addressed. I will highlight this in the related feature doc.
Note: this job replaces the (previously assumed) need for a separate file downloads API for the sake of HEAL. However, Sower's limitation on input data requires that input data fit in an environment variable, hence we send a list of study IDs rather than an entire manifest. If a manifest, DRS bundle, etc is preferable for another use case, this job is easily extensible to use a user-specific s3 file as a job input instead. Batch-export cloud automation is setup with a bucket for this case. |
for study in study_metadata: | ||
if not study["__manifest"]: | ||
print( | ||
f"Study {study['project_number']} is missing __manifest entry. Skipping." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the key
configurable? Might it be project_number
for some entries, but study_id
for others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth removing this key altogether, since it's only ever needed for this informational log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it might be safer to not assume the content/structure of the data since the MDS really allows for any shape. I think you're right that project_number
is most common. Maybe best to just set up that log line so it wouldn't error/crash if that field was missing
# print(f"Retrieved metadata: {study_metadata}") | ||
manifest = [] | ||
for study in study_metadata: | ||
if not study["__manifest"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we make it configurable which field name we use? Is it __manifest
for some and __files
for others? Not saying it's good if it's configurable, but just checking. I wonder if we need to query the gitops
info to see what the field name is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HEAL, I think it's always been __manifest
: https://github.com/uc-cdis/cdis-manifest/blob/369b80452adb4a91fdef7fd1b6f2e283c12168e6/healdata.org/portal/gitops.json#L126
At least that's how we've been building manifests from the UI
But I can change this to be passed as an env var from sower, not sure if there's a way to avoid that being a duplicate configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use put some bash script in your cloud-auto PR to extract that from portal config and put it into a k8s configmap or secrets and mount them
Edit: oops didn't see you just merged that cloud-auto PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I search across our repos I only see us ever setting manifestFieldName
to __manifest
. I think the best thing to do would be to not have that field configurable, keep it hard-coded here, and take it out of configurations. I think making that field configurable just adds more headache especially if it's consistent everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I remember I have mention to have a set of "pre-defined" field name rules for MDS data that serves as a convention enforced over all MDS
This way we can reduce this kind of configuration complexity by a lot
aws_secret_access_key=aws_secret_access_key, | ||
) | ||
|
||
export_key = f"{username}-export.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk/possibility that an IDP may allow a user to be authenticated, but no username
is set? Or could the username have special characters that don't play well with S3 or various file systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are both good questions, but I'm not sure what fields an IDP requires or what characters they can take on. Can you think of a more appropriate way to name exports? I'm brainstorming...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A guid
might be a nice way to simply make file names unique. That also allows us to not have to worry about storing names/identifying information in S3 or anywhere else
A timestamp + guid is also nice because it allows us to see how old exports are and expire them while enforcing uniqueness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works well, good job ~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple minor notes, but lgtm!
7b631b5
New Features
{ "study_ids": [ "acb-xyz" ] }
and constructs a manifest of the files in these studies, downloads and compresses them, and uploads them to a user-specific location in s3 (which is overwritten each time a user initiates a download)