Allow access to S3 versioned datasets #196

jameshadfield · 2020-06-25T03:45:02Z

This issue is the expansion on a comment by @joverlee521 in #192 (comment)

Would it help to take advantage of the versioning enabled in the nextstrain-data bucket? I believe you can retrieve specific versions of an object from S3

This would be interesting for nextstrain.org to explore as an alternate way in which we could access "old" datasets instead of making separate datestamped files. Note that nextstrain-data is already using versioning. The best / first usecase for this would be the seasonal flu builds, especially when combined with the search functionality of #192.

It's unclear whether this approach would be better than simply keeping datestamped files on the bucket, but if we want to pursue this we could implement a complete solution within the nextstrain.org server with no auspice modifications necessary.

Example

The h1n1-ha-3y dataset from 2020-01-01 could be accessed via https://nextstrain-data.s3.amazonaws.com/flu_seasonal_h1n1pdm_ha_3y.json?versionId=1HEJjqgcCardojwksJF1wIyDyGD3HE_s were the object to be public, or we can do it server side (with credentials) via

S3.getObject({
  Bucket: "nextstrain-data",
  Key: "flu_seasonal_h1n1pdm_ha_3y.json",
  VersionId: "1HEJjqgcCardojwksJF1wIyDyGD3HE_s"
}).promise()

Leveraging similar syntax which we already have for accessing datasets on a particular github branch -- e.g. /community/jameshadfield/scratch@test-branch/placentalia we could process URLs such as /flu/seasonal/h1n1pdm/ha/3y@1HEJjqgcCardojwksJF1wIyDyGD3HE_s or go one step further and have the server keep track of the upload date of each version to allow URLs such as /flu/seasonal/h1n1pdm/ha/3y@2020-01-01. Since the nextstrain.org server essentially acts as a middleman between S3 and the client it is possible to interpret such a getDataset API call and return the correct version of the object (file). There may be some cloudfront modifications needed for this to work.

In conjunction with this would be a page listing previous versions similar to the one we currently have for SARS-CoV-2 situation reports, or alternatively we could dynamically modify the getAvailable API response such that previous versions appear in auspice's dataset dropdown menu.

The text was updated successfully, but these errors were encountered:

tsibley · 2020-06-26T19:38:32Z

Glad to see this getting discussed again! I do think it would be useful for certain kinds of builds like Flu and SARS-CoV-2.

When I was implementing the @<branch> syntax, I was envisioning extending it to accept other kinds of revision specifiers like dates that map to S3 object versions. (The syntax supported by git rev-parse is partly an inspiration here.) Implementation-wise, I don't think we need to store an explicit map of dates → S3 object version, but instead interpret @<date> as "the latest (or earliest?) version as of <date>". <date> can be YYYY-MM-DD to start, but could also accept (to varying degrees), relative specifications like yesterday or even ISO 8601 syntax like P6M for "6 months ago". We could also support things like @1, @2, … to mean 1 version ago, 2 versions ago, etc. Relative specifiers like that (whether dates or numbers or both) enable possibly interesting things like a stable URL for comparing changes to a dataset over time with tangletrees, e.g. https://nextstrain.org/ncov/global:/ncov/global@1 for comparing the difference between two ncov builds or https://nextstrain.org/flu/seasonal/h3n2/ha/2y:/flu/seasonal/h3n2/ha/2y@P6M for comparing seasonal flu between now and 6 months ago.

jameshadfield · 2023-07-19T23:57:12Z

@trs would you be comfortable with us implementing @versionId to start off with?

e.g. nextstrain.org/ebola@Y_5o_1ij5yMhX23opio_GIC8KiHYytcI <-> https://nextstrain-data.s3.amazonaws.com/ebola.json?versionId=Y_5o_1ij5yMhX23opio_GIC8KiHYytcI

tsibley · 2023-07-21T00:10:11Z

Off the cuff, I don't think ideally we'd expose S3 version ids at all. It'd be a large encapsulation leak and make it harder to implement something better in the future. If we really need to expose them for expediency (and I don't think we do…?), I'd want to prefix them with a namespace to avoid issues with other forms in the future, e.g. ebola@s3VersionId=Y_5o_1ij5yMhX23opio_GIC8KiHYytcI or similar.

I don't see us leaving S3 here, so am not worried about consequences of that re: exposure of version ids, but I do think the encapsulation break is likely to come back to bite us in other aspects of nextstrain.org.

jameshadfield · 2024-04-10T04:31:31Z

Closed by #719 (improvements, such as @6M syntax are possible, but since we have a UI to surface the past versions I don't think they're a priority)

jameshadfield mentioned this issue Jun 25, 2020

Generalise strain search functionality #192

Merged

eharkins mentioned this issue Nov 19, 2020

Update sars-cov-2 search regularly #230

Merged

eharkins mentioned this issue Dec 1, 2020

collect-search-results.js only fetch updated datasets #233

Open

jameshadfield mentioned this issue Jan 24, 2021

Allow data versioning nextstrain/auspice#31

Closed

nextstrain-bot added this to Nextstrain planning (archived) Jul 20, 2023

github-project-automation bot moved this to New in Nextstrain planning (archived) Jul 20, 2023

jameshadfield closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow access to S3 versioned datasets #196

Allow access to S3 versioned datasets #196

jameshadfield commented Jun 25, 2020

tsibley commented Jun 26, 2020 •

edited

Loading

jameshadfield commented Jul 19, 2023

tsibley commented Jul 21, 2023

jameshadfield commented Apr 10, 2024

Allow access to S3 versioned datasets #196

Allow access to S3 versioned datasets #196

Comments

jameshadfield commented Jun 25, 2020

Example

tsibley commented Jun 26, 2020 • edited Loading

jameshadfield commented Jul 19, 2023

tsibley commented Jul 21, 2023

jameshadfield commented Apr 10, 2024

tsibley commented Jun 26, 2020 •

edited

Loading