Skip to content

Commit

Permalink
Merge pull request #1350 from jfy133/createtaxdb
Browse files Browse the repository at this point in the history
createtaxdb: add broken samplesheets
  • Loading branch information
jfy133 authored Oct 17, 2024
2 parents 57633c9 + a4cdbea commit 80c00f5
Show file tree
Hide file tree
Showing 6 changed files with 34 additions and 5 deletions.
22 changes: 17 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# ![nfcore/test-datasets](docs/images/test-datasets_logo.png)

Test data to be used for automated testing with the nf-core pipelines

> ⚠️ **Do not merge your test data to `master`! Each pipeline has a dedicated branch (and a special one for modules)**
Expand All @@ -13,8 +14,8 @@ The principle for nf-core test data is as small as possible, as large as necessa

nf-core/test-datasets comes with documentation in the `docs/` directory:

01. [Add a new test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/ADD_NEW_DATA.md)
02. [Use an existing test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/USE_EXISTING_DATA.md)
1. [Add a new test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/ADD_NEW_DATA.md)
2. [Use an existing test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/USE_EXISTING_DATA.md)

## Downloading test data

Expand All @@ -41,15 +42,26 @@ For further information or help, don't hesitate to get in touch on our [Slack or

### FASTA files

FASTA reference files used for building databases are copies of the nf-core/modules test dataset files (`sarscov2` and `haemophilus_influenzae` files) as of December 2023.
FASTA reference files used for building databases are copies of the nf-core/modules test dataset files (`sarscov2` and `haemophilus_influenzae` files) as of December 2023.

- [sarscov2.fasta](https://github.com/nf-core/test-datasets/blob/0d5006780e17a3b11a36437d220c372c2e6e4ed0/data/genomics/sarscov2/genome/genome.fasta)
- [sarscov2.faa](https://github.com/nf-core/test-datasets/blob/89f6476aa0006451c1e9ea789ce4e4173c892319/data/genomics/sarscov2/genome/proteome.fasta)
- [haemophilus_influenzae.fna.gz](https://github.com/nf-core/test-datasets/blob/575e27aa850e186d4bcf85afc5572648aa35f2f4/data/genomics/prokaryotes/haemophilus_influenzae/genome/genome.fna.gz)


### taxonomy files

These are NCBI taxdump re-constructed files, where the entries only include those of the two FASTA files above (rather than the entire tax dump).

- Prot taxdump: as of December 2023
- Prot taxdump: as of December 2023

## Broken Samplesheets

To help improve schema checking, we've taking then main `test.csv`, and added a few variants which have various errors.

Each file _should_ fail and give an error message from nf-schema.

- `samplesheets/broken/test_duplicate_id_and_path.csv`: has cells in both `id` and `fasta_aa` duplicated when they should be unique
- `samplesheets/broken/test_duplicate_id_only.csv`: has cells in only `id` duplicated, when all cells should be unique
- `samplesheets/broken/test_missing_both_paths.csv`: has a row where both required `fasta_dna` and `fasta_aa` paths are missing
- `samplesheets/broken/test_missing_required_column.csv`: missing the required `taxid` column
- `samplesheets/broken/test_non_existent_file.csv`: has a path to a `fasta_dna` filepath that doesn't exist
4 changes: 4 additions & 0 deletions samplesheets/broken/test_duplicate_id_and_path.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
id,taxid,fasta_dna,fasta_aa
Severe_acute_respiratory_syndrome_coronavirus_2,2697049,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.fasta,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Severe_acute_respiratory_syndrome_coronavirus_2,2697049,https://github.com/nf-core/test-datasets/blob/modules/data/genomics/prokaryotes/bacteroides_fragilis/genome/genome.fna.gz,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Haemophilus_influenzae,727,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/haemophilus_influenzae.fna.gz,
4 changes: 4 additions & 0 deletions samplesheets/broken/test_duplicate_id_only.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
id,taxid,fasta_dna,fasta_aa
Severe_acute_respiratory_syndrome_coronavirus_2,2697049,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.fasta,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Severe_acute_respiratory_syndrome_coronavirus_22,2697049,https://github.com/nf-core/test-datasets/blob/modules/data/genomics/prokaryotes/bacteroides_fragilis/genome/genome.fna.gz,
Haemophilus_influenzae,727,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/haemophilus_influenzae.fna.gz,
3 changes: 3 additions & 0 deletions samplesheets/broken/test_missing_both_paths.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
id,taxid,fasta_dna,fasta_aa
Severe_acute_respiratory_syndrome_coronavirus_2,2697049,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.fasta,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Haemophilus_influenzae,727,,
3 changes: 3 additions & 0 deletions samplesheets/broken/test_missing_required_column.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
id,fasta_dna,fasta_aa
Severe_acute_respiratory_syndrome_coronavirus_2,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.fasta,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Haemophilus_influenzae,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/haemophilus_influenzae.fna.gz,
3 changes: 3 additions & 0 deletions samplesheets/broken/test_non_existent_file.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
id,taxid,fasta_dna,fasta_aa
Severe_acute_respiratory_syndrome_coronavirus_2,2697049,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.fasta,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Haemophilus_influenzae,727,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/haemophilus_influenzaexxxxxx.fna.gz,

0 comments on commit 80c00f5

Please sign in to comment.