Update sars-cov-2 search regularly #230

eharkins · 2020-11-19T00:42:47Z

This is a Github action which runs scripts/collect-search-results.js and uploads this data to s3 in order to update https://nextstrain.org/search/sars-cov-2. It is scheduled to happen every weekday as close as possible to the time by which we have usually published new sars-cov-2 (ncov) builds to s3; as I wrote in a comment in the workflow itself, the timing of this is imperfect for sars-cov-2 (ncov) builds because they are run in different time zones on alternating days. More jobs in this workflow or separate ones with separate timelines can be added for other pathogens we want to maintain search pages for. If we want to make this 'smarter' we can use something like aws lambdas which could automatically run upon updating of certain files in an s3 bucket, but that doesn't seem to be such a huge improvement over this to necessitate the time it would take (at least for me with my lack of lambdas experience).

Description of proposed changes

What is the goal of this pull request? What does this pull request change?

Related issue(s)

Slack thread
#187
#192
#196
#182

Testing

I will trigger this manually now to test that it works and also monitor it to see that the cron schedule is as desired (I think it will apply even though it's in a branch since I didn't say anything about which branch it runs on, but I need to check the default there.

Thank you for contributing to Nextstrain!

this is a Github action which runs scripts/collect-search-results.js every weekday as close as possible to the time by which we have usually published new sars-cov-2 (ncov) builds to s3. More jobs in this workflow or separate ones with separate timelines can be added for other pathogens we want to maintain search pages for. If we want to make this 'smarter' we can use something like aws lambdas which could automatically run upon updating of certain files in an s3 bucket

.github/workflows/update-search.yml

emmahodcroft · 2020-11-19T21:28:27Z

I'm not really qualified to review the code for this, but the functionality sounds great, and I think this seems like a good solution to me, as far as being (relatively) easy to get up and running, and will greatly improve having that search function stay up-to-date!

you need to explicitly keep changes to the environment from one step to the next within a github action job by using the PATH variable if I remember correctly. Avoiding dealing with this for now by sticking it all in the same step, since if we want to update other search databases in addition to sars-cov-2, they'll either be on the same timeline and can be tacked onto this step, or a separate timeline and will therefore need a separate workflow. Unless we want some updates to fail within a workflow and not have it prevent the rest (which seems smart).

eharkins · 2020-11-20T02:03:41Z

This basically works now. Still todo:

allow triggering manually from the command line using a copy of https://github.com/nextstrain/ncov-ingest/blob/master/bin/trigger (or maybe we can share this across the organization if we think it can be generalized)
test cron timing
add other pathogens in addition to sars-cov-2
Fetching dataset script improvements (separate PR)

eharkins · 2020-11-20T18:11:38Z

allow triggering manually from the command line using a copy of https://github.com/nextstrain/ncov-ingest/blob/master/bin/trigger (or maybe we can share this across the organization if we think it can be generalized)

test cron timing

These only work on default branch for a repository so won't be able to test until we merge.

How often would we like to update https://nextstrain.org/search/seasonal-flu and are there other search pages we have or would like to have that we should include here?

Once we get those added, I think it's ok to merge and then begin testing the different ways to run this workflow mentioned above.

this removes two ways of triggering the workflow: 1. On push (this was for testing purposes) 2. On repository_dispatch this only works on master and has not been able to be tested on this branch. It also could be seen as redundant to just running the update-search script locally (although this assumes you have installed dependencies)

jameshadfield

Looks good Eli! Thanks for making those recent changes. Let's merge at a time when we can monitor that the schedule works as expected.

eharkins · 2020-12-01T01:46:58Z

Looking like the schedule is working now.

I let Jover know this exists in case we want to use it for seasonal-flu.

I also realized the only difference between

allow triggering manually from the command line

and running this locally as we have done in the past would be that it happens on gh actions, which has pros and cons. I would imagine

Fetching dataset script improvements (separate PR)

is a bigger priority, @jameshadfield did you have specific improvements in mind or did you just want to generally fetch things more quickly/efficiently?

jameshadfield · 2020-12-01T21:40:28Z

specific improvements in mind or did you just want to generally fetch things more quickly/efficiently

The script as written was amost a proof-of-principle approach. Each time it runs it downloads every JSON; a better approach would only download new/changed JSONs and store the results somewhere for access next time. Not a huge issue right now as S3 access is cheap, but something to consider in the future as needed. I would place other things - SARS front-page UI etc - as a higher priority.

eharkins requested review from jameshadfield, kairstenfay and emmahodcroft November 19, 2020 00:42

tsibley temporarily deployed to nextstrain-s-update-sea-r26crz November 19, 2020 00:43 Inactive

eharkins commented Nov 19, 2020

View reviewed changes

.github/workflows/update-search.yml Show resolved Hide resolved

trigger workflow on push to update-search branch for testing

d416c52

eharkins temporarily deployed to nextstrain-s-update-sea-r26crz November 20, 2020 01:08 Inactive

eharkins temporarily deployed to nextstrain-s-update-sea-r26crz November 20, 2020 01:17 Inactive

add nextstrain to path in workflow

878362b

eharkins temporarily deployed to nextstrain-s-update-sea-r26crz November 20, 2020 01:28 Inactive

jameshadfield temporarily deployed to nextstrain-s-update-sea-l4zuau November 25, 2020 21:51 Inactive

eharkins temporarily deployed to nextstrain-s-update-sea-l4zuau November 26, 2020 00:36 Inactive

jameshadfield approved these changes Nov 26, 2020

View reviewed changes

eharkins merged commit d6a64f5 into master Nov 30, 2020

jameshadfield deleted the update-search branch December 1, 2020 21:40

eharkins mentioned this pull request Dec 1, 2020

collect-search-results.js only fetch updated datasets #233

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update sars-cov-2 search regularly #230

Update sars-cov-2 search regularly #230

eharkins commented Nov 19, 2020

emmahodcroft commented Nov 19, 2020 •

edited

Loading

eharkins commented Nov 20, 2020

eharkins commented Nov 20, 2020

jameshadfield left a comment

eharkins commented Dec 1, 2020

jameshadfield commented Dec 1, 2020

Update sars-cov-2 search regularly #230

Update sars-cov-2 search regularly #230

Conversation

eharkins commented Nov 19, 2020

Description of proposed changes

Related issue(s)

Testing

Thank you for contributing to Nextstrain!

emmahodcroft commented Nov 19, 2020 • edited Loading

eharkins commented Nov 20, 2020

eharkins commented Nov 20, 2020

jameshadfield left a comment

Choose a reason for hiding this comment

eharkins commented Dec 1, 2020

jameshadfield commented Dec 1, 2020

emmahodcroft commented Nov 19, 2020 •

edited

Loading