Skip to content

Commit

Permalink
Add resource collection docs
Browse files Browse the repository at this point in the history
Includes documentation of the AWS changes which are not under terraform
control, as well as a general introduction to the general concept.
  • Loading branch information
jameshadfield committed Nov 2, 2023
1 parent d3d304c commit 255a329
Show file tree
Hide file tree
Showing 2 changed files with 91 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ nextstrain.org
routing
infrastructure
terraform
resource-collection
glossary
90 changes: 90 additions & 0 deletions docs/resource-collection.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
===================
Resource Collection
===================

In order for nextstrain.org to handle URLs with `@YYYY-MM-DD` identifiers the
server needs to be aware of which files exist, including past versions.
In the future this data will also be used to list and display all available
resources (and their versions) to the user.

The index is generated by a script and the
resulting JSON file is loaded by the server at start time.

Local development
=================

The index creation script can be run locally which will produce a local JSON
file -- see ``./resourceIndexer/main.js`` for more details.

To use this file from the server set the env variable
``LOCAL_RESOURCE_INDEX`` to point to the (JSON) file.


Automated index generation
==========================

*This section will be updated once the
index creation is automated.*

AWS settings necessary for resource collection
==============================================

The index creation, storage and retrieval requires certain AWS settings which
are documented here as most of them are not under terraform control. We use `S3
inventories
<https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html>`__
to list all the documents in certain buckets (or bucket prefixes) which are
generated daily by AWS. The index creation script will download these
inventories and use them to create an index JSON which it uploads to S3. The
nextstrain.org server will access this JSON from S3.

S3 inventories
--------------

We currently produce inventories for the core (s3://nextstrain-data) and
staging (s3://nextstrain-staging) buckets which are generated daily and
published to s3://nextstrain-inventories. The
s3://nextstrain-inventories bucket is a private bucket. The inventory
configuration can be found in the AWS console for
`core <https://s3.console.aws.amazon.com/s3/management/nextstrain-data/inventory/view?region=us-east-1&id=config-v1>`__
and
`staging <https://s3.console.aws.amazon.com/s3/management/nextstrain-staging/inventory/view?region=us-east-1&id=config-v1>`__.
The config specifies that additional metadata fields for last modified
and ETag are to be included in the inventory. The inventories for core &
staging are published to
s3://nextstrain-inventories/nextstrain-data/config-v1 and
s3://nextstrain-inventories/nextstrain-staging/config-v1, respectively.
The cost of these is minimal (less than $1/bucket/year).

A lifecycle rule on the s3://nextstrain-inventories bucket (`console
link <https://s3.console.aws.amazon.com/s3/management/nextstrain-inventories/lifecycle/view?region=us-east-1&id=delete+stale+inventories>`__)
deletes all inventory-related files 30 days after they are created.

Index creation (Inventory access and index upload)
--------------------------------------------------

**Automated index generation**

*This section will be updated once the
index creation is automated.*

**Local index generation for development purposes**

For local index generation (e.g. during development) you will need IAM
credentials which can list and get objects from s3://nextstrain-inventories; if
you want finer scale access for local index creation, you can restrict access to
certain prefixes in that bucket - for instance ``nextstrain-data/config-v1`` and
``nextstrain-staging/config-v1`` correspond to core and staging buckets,
respectively.

To upload the index you will need write access for
s3://nextstrain-inventories/resources.json.gz. Note that if your aims are
limited to local development purposes this is not necessary (see `Local development`_).


Index access by the server
--------------------------

IAM users ``nextstrain.org`` and ``nextstrain.org-testing``, which are under
terraform control, have read access to
s3://nextstrain-inventories/resources.json.gz via their associated policies.

0 comments on commit 255a329

Please sign in to comment.