Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update sars-cov-2 search regularly #230

Merged
merged 5 commits into from
Nov 30, 2020
Merged

Update sars-cov-2 search regularly #230

merged 5 commits into from
Nov 30, 2020

Conversation

eharkins
Copy link
Contributor

This is a Github action which runs scripts/collect-search-results.js and uploads this data to s3 in order to update https://nextstrain.org/search/sars-cov-2. It is scheduled to happen every weekday as close as possible to the time by which we have usually published new sars-cov-2 (ncov) builds to s3; as I wrote in a comment in the workflow itself, the timing of this is imperfect for sars-cov-2 (ncov) builds because they are run in different time zones on alternating days. More jobs in this workflow or separate ones with separate timelines can be added for other pathogens we want to maintain search pages for. If we want to make this 'smarter' we can use something like aws lambdas which could automatically run upon updating of certain files in an s3 bucket, but that doesn't seem to be such a huge improvement over this to necessitate the time it would take (at least for me with my lack of lambdas experience).

Description of proposed changes

What is the goal of this pull request? What does this pull request change?

Related issue(s)

Slack thread
#187
#192
#196
#182

Testing

I will trigger this manually now to test that it works and also monitor it to see that the cron schedule is as desired (I think it will apply even though it's in a branch since I didn't say anything about which branch it runs on, but I need to check the default there.

Thank you for contributing to Nextstrain!

this is a Github action which runs
scripts/collect-search-results.js
every weekday as close as possible
to the time by which we have usually
published new sars-cov-2 (ncov) builds
to s3. More jobs in this workflow or
separate ones with separate timelines
can be added for other pathogens we
want to maintain search pages for.
If we want to make this 'smarter' we
can use something like aws lambdas
which could automatically run upon
updating of certain files in an s3
bucket
@tsibley tsibley temporarily deployed to nextstrain-s-update-sea-r26crz November 19, 2020 00:43 Inactive
@emmahodcroft
Copy link
Member

emmahodcroft commented Nov 19, 2020

I'm not really qualified to review the code for this, but the functionality sounds great, and I think this seems like a good solution to me, as far as being (relatively) easy to get up and running, and will greatly improve having that search function stay up-to-date!

@eharkins eharkins temporarily deployed to nextstrain-s-update-sea-r26crz November 20, 2020 01:08 Inactive
you need to explicitly keep
changes to the environment
from one step to the next
within a github action job
by using the PATH variable
if I remember correctly.
Avoiding dealing with this
for now by sticking it all
in the same step, since if
we want to update other
search databases in addition
to sars-cov-2, they'll either
be on the same timeline and
can be tacked onto this step,
or a separate timeline and
will therefore need a separate
workflow. Unless we want some
updates to fail within a workflow
and not have it prevent the rest
(which seems smart).
@eharkins eharkins temporarily deployed to nextstrain-s-update-sea-r26crz November 20, 2020 01:17 Inactive
@eharkins eharkins temporarily deployed to nextstrain-s-update-sea-r26crz November 20, 2020 01:28 Inactive
@eharkins
Copy link
Contributor Author

This basically works now. Still todo:

@eharkins
Copy link
Contributor Author

allow triggering manually from the command line using a copy of https://github.com/nextstrain/ncov-ingest/blob/master/bin/trigger (or maybe we can share this across the organization if we think it can be generalized)

test cron timing

These only work on default branch for a repository so won't be able to test until we merge.

How often would we like to update https://nextstrain.org/search/seasonal-flu and are there other search pages we have or would like to have that we should include here?

Once we get those added, I think it's ok to merge and then begin testing the different ways to run this workflow mentioned above.

@jameshadfield jameshadfield temporarily deployed to nextstrain-s-update-sea-l4zuau November 25, 2020 21:51 Inactive
this removes two ways of triggering
the workflow:
1. On push (this was for testing purposes)
2. On repository_dispatch this only works
on master and has not been able to be tested
on this branch. It also could be seen as
redundant to just running the update-search
script locally (although this assumes you
have installed dependencies)
@eharkins eharkins temporarily deployed to nextstrain-s-update-sea-l4zuau November 26, 2020 00:36 Inactive
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good Eli! Thanks for making those recent changes. Let's merge at a time when we can monitor that the schedule works as expected.

@eharkins eharkins merged commit d6a64f5 into master Nov 30, 2020
@eharkins
Copy link
Contributor Author

eharkins commented Dec 1, 2020

Looking like the schedule is working now.

I let Jover know this exists in case we want to use it for seasonal-flu.

I also realized the only difference between

allow triggering manually from the command line

and running this locally as we have done in the past would be that it happens on gh actions, which has pros and cons. I would imagine

Fetching dataset script improvements (separate PR)

is a bigger priority, @jameshadfield did you have specific improvements in mind or did you just want to generally fetch things more quickly/efficiently?

@jameshadfield
Copy link
Member

specific improvements in mind or did you just want to generally fetch things more quickly/efficiently

The script as written was amost a proof-of-principle approach. Each time it runs it downloads every JSON; a better approach would only download new/changed JSONs and store the results somewhere for access next time. Not a huge issue right now as S3 access is cheap, but something to consider in the future as needed. I would place other things - SARS front-page UI etc - as a higher priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants