Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to augur curate #9

Merged
merged 2 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ curate:
strain_backup_fields: ["accession"]
# List of date fields to standardize to ISO format YYYY-MM-DD
date_fields: ["date", "date_released", "date_updated"]
# The expected field that contains the GenBank geo_loc_name
genbank_location_field: location
# List of expected date formats that are present in the date fields provided above
# These date formats should use directives expected by datetime
# See https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
Expand Down
15 changes: 8 additions & 7 deletions ingest/rules/curate.smk
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ rule curate:
strain_regex=config["curate"]["strain_regex"],
strain_backup_fields=config["curate"]["strain_backup_fields"],
date_fields=config["curate"]["date_fields"],
genbank_location_field=config["curate"]["genbank_location_field"],
expected_date_formats=config["curate"]["expected_date_formats"],
articles=config["curate"]["titlecase"]["articles"],
abbreviations=config["curate"]["titlecase"]["abbreviations"],
Expand All @@ -85,30 +86,30 @@ rule curate:
shell:
"""
(cat {input.sequences_ndjson} \
| ./vendored/transform-field-names \
| augur curate rename \
--field-map {params.field_map} \
| augur curate normalize-strings \
| ./vendored/transform-strain-names \
| augur curate transform-strain-name \
--strain-regex {params.strain_regex} \
--backup-fields {params.strain_backup_fields} \
| augur curate format-dates \
--date-fields {params.date_fields} \
--expected-date-formats {params.expected_date_formats} \
| ./vendored/transform-genbank-location \
| augur curate parse-genbank-location \
--location-field {params.genbank_location_field} \
| augur curate titlecase \
--titlecase-fields {params.titlecase_fields} \
--articles {params.articles} \
--abbreviations {params.abbreviations} \
| ./vendored/transform-authors \
| augur curate abbreviate-authors \
--authors-field {params.authors_field} \
--default-value {params.authors_default_value} \
--abbr-authors-field {params.abbr_authors_field} \
| ./vendored/apply-geolocation-rules \
| augur curate apply-geolocation-rules \
--geolocation-rules {input.all_geolocation_rules} \
| ./vendored/merge-user-metadata \
| augur curate apply-record-annotations \
--annotations {input.annotations} \
--id-field {params.annotations_id} \
| augur curate passthru \
--output-metadata {output.metadata} \
--output-fasta {output.sequences} \
--output-id-field {params.id_field} \
Expand Down
3 changes: 0 additions & 3 deletions ingest/vendored/.cramrc

This file was deleted.

8 changes: 0 additions & 8 deletions ingest/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,3 @@ jobs:
steps:
- uses: actions/checkout@v4
- uses: nextstrain/.github/actions/shellcheck@master

cram:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: pip install cram
- run: cram tests/
4 changes: 2 additions & 2 deletions ingest/vendored/.gitrepo
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[subrepo]
remote = https://github.com/nextstrain/ingest
branch = main
commit = c94d78d1f38b99e893007a76526f3d3824ecded0
parent = 99034d1912479521e2c657c1a898aee6f803bb67
commit = 258ab8ce898a88089bc88caee336f8d683a0e79a
parent = c06187ad53db4d9d6beb1afbd15ca0078c5b539c
method = merge
cmdver = 0.4.6
14 changes: 0 additions & 14 deletions ingest/vendored/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,15 +117,6 @@ Potential Nextstrain CLI scripts
- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`.

Potential augur curate scripts

- [apply-geolocation-rules](apply-geolocation-rules) - Applies user curated geolocation rules to NDJSON records
- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
- [transform-strain-names](transform-strain-names) - Ordered search for strain names across several fields.

## Software requirements

Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (`/bin/bash`) does not meet this requirement. You can install [Homebrew's Bash](https://formulae.brew.sh/formula/bash) which is more up to date.
Expand All @@ -134,11 +125,6 @@ Some scripts may require Bash ≥4. If you are running these scripts on macOS, t

Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.

For more locally testable scripts, Cram-style functional tests live in `tests` and are run as part of CI. To run these locally,

1. Download Cram: `pip install cram`
2. Run the tests: `cram tests/`

## Working on this repo

This repo is configured to use [pre-commit](https://pre-commit.com),
Expand Down
234 changes: 0 additions & 234 deletions ingest/vendored/apply-geolocation-rules

This file was deleted.

55 changes: 0 additions & 55 deletions ingest/vendored/merge-user-metadata

This file was deleted.

Loading