Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow access to S3 versioned datasets #196

Closed
jameshadfield opened this issue Jun 25, 2020 · 4 comments
Closed

Allow access to S3 versioned datasets #196

jameshadfield opened this issue Jun 25, 2020 · 4 comments

Comments

@jameshadfield
Copy link
Member

This issue is the expansion on a comment by @joverlee521 in #192 (comment)

Would it help to take advantage of the versioning enabled in the nextstrain-data bucket? I believe you can retrieve specific versions of an object from S3

This would be interesting for nextstrain.org to explore as an alternate way in which we could access "old" datasets instead of making separate datestamped files. Note that nextstrain-data is already using versioning. The best / first usecase for this would be the seasonal flu builds, especially when combined with the search functionality of #192.

It's unclear whether this approach would be better than simply keeping datestamped files on the bucket, but if we want to pursue this we could implement a complete solution within the nextstrain.org server with no auspice modifications necessary.

Example

The h1n1-ha-3y dataset from 2020-01-01 could be accessed via https://nextstrain-data.s3.amazonaws.com/flu_seasonal_h1n1pdm_ha_3y.json?versionId=1HEJjqgcCardojwksJF1wIyDyGD3HE_s were the object to be public, or we can do it server side (with credentials) via

S3.getObject({
  Bucket: "nextstrain-data",
  Key: "flu_seasonal_h1n1pdm_ha_3y.json",
  VersionId: "1HEJjqgcCardojwksJF1wIyDyGD3HE_s"
}).promise()

Leveraging similar syntax which we already have for accessing datasets on a particular github branch -- e.g. /community/jameshadfield/scratch@test-branch/placentalia we could process URLs such as /flu/seasonal/h1n1pdm/ha/3y@1HEJjqgcCardojwksJF1wIyDyGD3HE_s or go one step further and have the server keep track of the upload date of each version to allow URLs such as /flu/seasonal/h1n1pdm/ha/3y@2020-01-01. Since the nextstrain.org server essentially acts as a middleman between S3 and the client it is possible to interpret such a getDataset API call and return the correct version of the object (file). There may be some cloudfront modifications needed for this to work.

In conjunction with this would be a page listing previous versions similar to the one we currently have for SARS-CoV-2 situation reports, or alternatively we could dynamically modify the getAvailable API response such that previous versions appear in auspice's dataset dropdown menu.

@tsibley
Copy link
Member

tsibley commented Jun 26, 2020

Glad to see this getting discussed again! I do think it would be useful for certain kinds of builds like Flu and SARS-CoV-2.

When I was implementing the @<branch> syntax, I was envisioning extending it to accept other kinds of revision specifiers like dates that map to S3 object versions. (The syntax supported by git rev-parse is partly an inspiration here.) Implementation-wise, I don't think we need to store an explicit map of dates → S3 object version, but instead interpret @<date> as "the latest (or earliest?) version as of <date>". <date> can be YYYY-MM-DD to start, but could also accept (to varying degrees), relative specifications like yesterday or even ISO 8601 syntax like P6M for "6 months ago". We could also support things like @1, @2, … to mean 1 version ago, 2 versions ago, etc. Relative specifiers like that (whether dates or numbers or both) enable possibly interesting things like a stable URL for comparing changes to a dataset over time with tangletrees, e.g. https://nextstrain.org/ncov/global:/ncov/global@1 for comparing the difference between two ncov builds or https://nextstrain.org/flu/seasonal/h3n2/ha/2y:/flu/seasonal/h3n2/ha/2y@P6M for comparing seasonal flu between now and 6 months ago.

@jameshadfield
Copy link
Member Author

@trs would you be comfortable with us implementing @versionId to start off with?

e.g. nextstrain.org/ebola@Y_5o_1ij5yMhX23opio_GIC8KiHYytcI <-> https://nextstrain-data.s3.amazonaws.com/ebola.json?versionId=Y_5o_1ij5yMhX23opio_GIC8KiHYytcI

@tsibley
Copy link
Member

tsibley commented Jul 21, 2023

Off the cuff, I don't think ideally we'd expose S3 version ids at all. It'd be a large encapsulation leak and make it harder to implement something better in the future. If we really need to expose them for expediency (and I don't think we do…?), I'd want to prefix them with a namespace to avoid issues with other forms in the future, e.g. ebola@s3VersionId=Y_5o_1ij5yMhX23opio_GIC8KiHYytcI or similar.

I don't see us leaving S3 here, so am not worried about consequences of that re: exposure of version ids, but I do think the encapsulation break is likely to come back to bite us in other aspects of nextstrain.org.

@jameshadfield
Copy link
Member Author

Closed by #719 (improvements, such as @6M syntax are possible, but since we have a UI to surface the past versions I don't think they're a priority)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

2 participants