Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naturalis #52

Merged
merged 38 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
0cfeb12
adding unifi pacbio example file
Nov 6, 2024
964e53c
adding naturalis plates
Nov 6, 2024
b728adf
removing unifi file from naturalis stash
Nov 6, 2024
200aacb
Add files via upload
rvosa Nov 6, 2024
39e38d8
Add files via upload
rvosa Nov 6, 2024
890c6d0
Add files via upload
rvosa Nov 6, 2024
3a4fea8
Add files via upload
rvosa Nov 6, 2024
4acdef4
Add files via upload
rvosa Nov 6, 2024
a8c2250
Add files via upload
rvosa Nov 6, 2024
4638ab7
Merge pull request #47 from naturalis/main
rvosa Nov 6, 2024
040e550
adding BGE 147-151
Nov 7, 2024
6793eb5
analyzed
Nov 7, 2024
999c784
analyzed
Nov 7, 2024
3b64cc2
adding plates
Nov 11, 2024
09ce40a
adding validation results
Nov 11, 2024
52c3e67
adding recent runs
Nov 13, 2024
012d253
reran validator
Nov 13, 2024
f258d84
Merge pull request #48 from naturalis/main
rvosa Nov 13, 2024
0c409d8
reduced verbosity of COI-5P marker splicing relative to HMM
rvosa Nov 14, 2024
14298d7
reduced default verbosity
rvosa Nov 14, 2024
30e5bba
adding structural validator
rvosa Nov 14, 2024
9facf39
the sequence.id is now the default biopython behaviour, i.e. the firs…
rvosa Nov 25, 2024
b4be1a1
tests updated to reflect that the fasta parser doesn't return the pro…
rvosa Nov 25, 2024
5bd8c10
tests updated to reflect that the fasta parser doesn't return the pro…
rvosa Nov 25, 2024
e564286
updated to reflect that the fasta parser doesn't return the process I…
rvosa Nov 25, 2024
6c8d61a
result object now tracks the full sequence ID rather than the process…
rvosa Nov 25, 2024
8b43c12
Columns need to be initialized separately.
rvosa Nov 25, 2024
83c4d31
Updated to reflect the new signatures for SequenceHandler.parse_fasta…
rvosa Nov 25, 2024
a250d35
the name of the fasta file is not attached as ancillary but using the…
rvosa Nov 25, 2024
6efb361
res.identification_rank needs to be set from the scoped config. Theor…
rvosa Nov 25, 2024
f19d53c
needed other mock config object so that it clones
rvosa Nov 25, 2024
13f54ba
the MGE FASTA files seem to have developed underscores as some magic …
rvosa Nov 25, 2024
50587d6
config files for HPC benchmarking
rvosa Nov 26, 2024
b384874
reran validation
rvosa Nov 26, 2024
eb3e1a6
example structural validation input/output
rvosa Nov 26, 2024
fc240d3
Merge branch 'naturalis' of github.com:naturalis/barcode-validator in…
rvosa Nov 26, 2024
4cb3827
updated to reflect the new validator output
rvosa Nov 26, 2024
09184c1
Merge pull request #51 from naturalis/main
rvosa Nov 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions barcode_validator/alignment.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,9 +146,9 @@ def translate_sequence(self, dna_sequence, table_idx):
self.logger.info("Translating DNA sequence to amino acids")

# Warn user
self.logger.warning("NOTE: we assume that the 658bp COI-5P marker has an additional base at the start")
self.logger.warning("NOTE: this first base needs to be removed to arrive at a multiple of 3 for AA translation")
self.logger.warning("NOTE: here we remove this base so that the result is 657 bases")
self.logger.info("NOTE: we assume that the 658bp COI-5P marker has an additional base at the start")
self.logger.info("NOTE: this first base needs to be removed to arrive at a multiple of 3 for AA translation")
self.logger.info("NOTE: here we remove this base so that the result is 657 bases")

# Clone and phase sequence by starting from the second base (i.e. index 1 in 0-based indexing)
cloned_seq = deepcopy(dna_sequence)
Expand Down
4 changes: 2 additions & 2 deletions barcode_validator/triage.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def read_tsv_data(tsv_file):
logging.debug(f"Reading TSV file: {tsv_file}")
for row in reader:
try:
data[row['process_id']] = row
data[row['sequence_id']] = row
except KeyError:
print(f"Error: Missing process_id in file {tsv_file}")
sys.exit(1)
Expand Down Expand Up @@ -189,7 +189,7 @@ def process_sequences(args):

logging.info(f"Starting sequence processing from {args.fasta}")
for record in SeqIO.parse(args.fasta, 'fasta'):
seq_id = record.id.split('_')[0]
seq_id = record.id
logging.debug(f"Processing sequence: {seq_id}")

if seq_id not in metadata:
Expand Down
59 changes: 59 additions & 0 deletions config/benchmarking/16_test_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Detauls of the repository. This is where the daemon will find local files in pull requests and how it will compose
# the URLs to the various pull request service endpoints. For the endpoints we are quite intimately tied to GitHub
# and the GitHub API (e.g. in the structure of the JSON that is returned), so this is not very flexible.
repo_location: ./
repo_owner: naturalis
repo_name: barcode_validator
pr_db_file: ../examples/pr_status.db

# Alignment configuration. This needs the location of an HMM file for the focal marker. This is used in conjunction with
# `hmmalign` from the HMMER package to align the sequences to the HMM. The assumption is that the executable is in the
# PATH because it is installed in the conda environment as a dependency of the barcode_validator package.
hmm_file: ../examples/COI-5P.hmm

# Which taxonomic level to use for the ID check. The idea is that the expected taxon (in BOLD) and the observed taxa
# (in NCBI) are at this level in the taxonomy and match exactly. So far this has worked well for the family level, and
# this should hold for higher taxa as well.
level: family

# Where to constrain the BLAST search. This indicates the higher taxon level within the NCBI taxonomy to which the
# observed taxa and the expected taxon belong and to which the search should be constrained. This is used to speed up
# the search and to avoid false positives.
constrain: class

# Configuration for local blast. The lower case values are used as arguments to the blastn command. The upper case
# values are used to set environment variables that are used by the blastn command. The BLASTDB environment variable
# is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE environment variable is used to
# specify the amount of RAM allocated to the blastn job.
blast_db: /home/rutger.vos/data/ncbi/nt/nt

# MaaS 39 has 64 cores, so we can use 56 for BLAST and leave 8 for the rest of the system.
num_threads: 16
evalue: 1e-5
max_target_seqs: 10
word_size: 28

# The BLASTDB environment variable is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE
# environment variable is used to specify the amount of RAM allocated to the blastn job. For some reason, setting the
# latter to any value causes the blastn job to fail with a segmentation fault. So let's not do that.
BLASTDB_LMDB_MAP_SIZE: 128G
BLASTDB: /home/rutger.vos/data/ncbi/nt

# Location of the NCBI taxonomy dump. This must be the tar.gz file that contains the nodes.dmp and names.dmp files.
# This corresponds with the dump made available at http://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
ncbi_taxonomy: /home/rutger.vos/data/ncbi/taxdump/taxdump.tar.gz

# Location of the BOLD Excel file. This file is used to match the process IDs (first word in the FASTA file headers)
# to the species names and higher taxon lineages. Instructions on how this file is generated can be found in the
# README.md file in this folder.
bold_sheet_file: ../examples/bold.xlsx

# Configuration for logging. The verbosity level specified here is overridden by the value provided on the command
# line with the -v/-verbosity argument. The log file is written to the current working directory.
log_level: WARNING
log_file: ../log_file.log

# Which translation table is used. This value is an integer that corresponds to the tables used by NCBI and parsed
# by biopython. The default is 5, which is the table for invertebrate mitochondrial DNA.
# See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG5
translation_table: 5
59 changes: 59 additions & 0 deletions config/benchmarking/1_test_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Detauls of the repository. This is where the daemon will find local files in pull requests and how it will compose
# the URLs to the various pull request service endpoints. For the endpoints we are quite intimately tied to GitHub
# and the GitHub API (e.g. in the structure of the JSON that is returned), so this is not very flexible.
repo_location: ./
repo_owner: naturalis
repo_name: barcode_validator
pr_db_file: ../examples/pr_status.db

# Alignment configuration. This needs the location of an HMM file for the focal marker. This is used in conjunction with
# `hmmalign` from the HMMER package to align the sequences to the HMM. The assumption is that the executable is in the
# PATH because it is installed in the conda environment as a dependency of the barcode_validator package.
hmm_file: ../examples/COI-5P.hmm

# Which taxonomic level to use for the ID check. The idea is that the expected taxon (in BOLD) and the observed taxa
# (in NCBI) are at this level in the taxonomy and match exactly. So far this has worked well for the family level, and
# this should hold for higher taxa as well.
level: family

# Where to constrain the BLAST search. This indicates the higher taxon level within the NCBI taxonomy to which the
# observed taxa and the expected taxon belong and to which the search should be constrained. This is used to speed up
# the search and to avoid false positives.
constrain: class

# Configuration for local blast. The lower case values are used as arguments to the blastn command. The upper case
# values are used to set environment variables that are used by the blastn command. The BLASTDB environment variable
# is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE environment variable is used to
# specify the amount of RAM allocated to the blastn job.
blast_db: /home/rutger.vos/data/ncbi/nt/nt

# MaaS 39 has 64 cores, so we can use 56 for BLAST and leave 8 for the rest of the system.
num_threads: 1
evalue: 1e-5
max_target_seqs: 10
word_size: 28

# The BLASTDB environment variable is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE
# environment variable is used to specify the amount of RAM allocated to the blastn job. For some reason, setting the
# latter to any value causes the blastn job to fail with a segmentation fault. So let's not do that.
BLASTDB_LMDB_MAP_SIZE: 128G
BLASTDB: /home/rutger.vos/data/ncbi/nt

# Location of the NCBI taxonomy dump. This must be the tar.gz file that contains the nodes.dmp and names.dmp files.
# This corresponds with the dump made available at http://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
ncbi_taxonomy: /home/rutger.vos/data/ncbi/taxdump/taxdump.tar.gz

# Location of the BOLD Excel file. This file is used to match the process IDs (first word in the FASTA file headers)
# to the species names and higher taxon lineages. Instructions on how this file is generated can be found in the
# README.md file in this folder.
bold_sheet_file: ../examples/bold.xlsx

# Configuration for logging. The verbosity level specified here is overridden by the value provided on the command
# line with the -v/-verbosity argument. The log file is written to the current working directory.
log_level: WARNING
log_file: ../log_file.log

# Which translation table is used. This value is an integer that corresponds to the tables used by NCBI and parsed
# by biopython. The default is 5, which is the table for invertebrate mitochondrial DNA.
# See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG5
translation_table: 5
59 changes: 59 additions & 0 deletions config/benchmarking/2_test_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Detauls of the repository. This is where the daemon will find local files in pull requests and how it will compose
# the URLs to the various pull request service endpoints. For the endpoints we are quite intimately tied to GitHub
# and the GitHub API (e.g. in the structure of the JSON that is returned), so this is not very flexible.
repo_location: ./
repo_owner: naturalis
repo_name: barcode_validator
pr_db_file: ../examples/pr_status.db

# Alignment configuration. This needs the location of an HMM file for the focal marker. This is used in conjunction with
# `hmmalign` from the HMMER package to align the sequences to the HMM. The assumption is that the executable is in the
# PATH because it is installed in the conda environment as a dependency of the barcode_validator package.
hmm_file: ../examples/COI-5P.hmm

# Which taxonomic level to use for the ID check. The idea is that the expected taxon (in BOLD) and the observed taxa
# (in NCBI) are at this level in the taxonomy and match exactly. So far this has worked well for the family level, and
# this should hold for higher taxa as well.
level: family

# Where to constrain the BLAST search. This indicates the higher taxon level within the NCBI taxonomy to which the
# observed taxa and the expected taxon belong and to which the search should be constrained. This is used to speed up
# the search and to avoid false positives.
constrain: class

# Configuration for local blast. The lower case values are used as arguments to the blastn command. The upper case
# values are used to set environment variables that are used by the blastn command. The BLASTDB environment variable
# is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE environment variable is used to
# specify the amount of RAM allocated to the blastn job.
blast_db: /home/rutger.vos/data/ncbi/nt/nt

# MaaS 39 has 64 cores, so we can use 56 for BLAST and leave 8 for the rest of the system.
num_threads: 2
evalue: 1e-5
max_target_seqs: 10
word_size: 28

# The BLASTDB environment variable is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE
# environment variable is used to specify the amount of RAM allocated to the blastn job. For some reason, setting the
# latter to any value causes the blastn job to fail with a segmentation fault. So let's not do that.
BLASTDB_LMDB_MAP_SIZE: 128G
BLASTDB: /home/rutger.vos/data/ncbi/nt

# Location of the NCBI taxonomy dump. This must be the tar.gz file that contains the nodes.dmp and names.dmp files.
# This corresponds with the dump made available at http://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
ncbi_taxonomy: /home/rutger.vos/data/ncbi/taxdump/taxdump.tar.gz

# Location of the BOLD Excel file. This file is used to match the process IDs (first word in the FASTA file headers)
# to the species names and higher taxon lineages. Instructions on how this file is generated can be found in the
# README.md file in this folder.
bold_sheet_file: ../examples/bold.xlsx

# Configuration for logging. The verbosity level specified here is overridden by the value provided on the command
# line with the -v/-verbosity argument. The log file is written to the current working directory.
log_level: WARNING
log_file: ../log_file.log

# Which translation table is used. This value is an integer that corresponds to the tables used by NCBI and parsed
# by biopython. The default is 5, which is the table for invertebrate mitochondrial DNA.
# See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG5
translation_table: 5
59 changes: 59 additions & 0 deletions config/benchmarking/32_test_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Detauls of the repository. This is where the daemon will find local files in pull requests and how it will compose
# the URLs to the various pull request service endpoints. For the endpoints we are quite intimately tied to GitHub
# and the GitHub API (e.g. in the structure of the JSON that is returned), so this is not very flexible.
repo_location: ./
repo_owner: naturalis
repo_name: barcode_validator
pr_db_file: ../examples/pr_status.db

# Alignment configuration. This needs the location of an HMM file for the focal marker. This is used in conjunction with
# `hmmalign` from the HMMER package to align the sequences to the HMM. The assumption is that the executable is in the
# PATH because it is installed in the conda environment as a dependency of the barcode_validator package.
hmm_file: ../examples/COI-5P.hmm

# Which taxonomic level to use for the ID check. The idea is that the expected taxon (in BOLD) and the observed taxa
# (in NCBI) are at this level in the taxonomy and match exactly. So far this has worked well for the family level, and
# this should hold for higher taxa as well.
level: family

# Where to constrain the BLAST search. This indicates the higher taxon level within the NCBI taxonomy to which the
# observed taxa and the expected taxon belong and to which the search should be constrained. This is used to speed up
# the search and to avoid false positives.
constrain: class

# Configuration for local blast. The lower case values are used as arguments to the blastn command. The upper case
# values are used to set environment variables that are used by the blastn command. The BLASTDB environment variable
# is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE environment variable is used to
# specify the amount of RAM allocated to the blastn job.
blast_db: /home/rutger.vos/data/ncbi/nt/nt

# MaaS 39 has 64 cores, so we can use 56 for BLAST and leave 8 for the rest of the system.
num_threads: 32
evalue: 1e-5
max_target_seqs: 10
word_size: 28

# The BLASTDB environment variable is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE
# environment variable is used to specify the amount of RAM allocated to the blastn job. For some reason, setting the
# latter to any value causes the blastn job to fail with a segmentation fault. So let's not do that.
BLASTDB_LMDB_MAP_SIZE: 128G
BLASTDB: /home/rutger.vos/data/ncbi/nt

# Location of the NCBI taxonomy dump. This must be the tar.gz file that contains the nodes.dmp and names.dmp files.
# This corresponds with the dump made available at http://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
ncbi_taxonomy: /home/rutger.vos/data/ncbi/taxdump/taxdump.tar.gz

# Location of the BOLD Excel file. This file is used to match the process IDs (first word in the FASTA file headers)
# to the species names and higher taxon lineages. Instructions on how this file is generated can be found in the
# README.md file in this folder.
bold_sheet_file: ../examples/bold.xlsx

# Configuration for logging. The verbosity level specified here is overridden by the value provided on the command
# line with the -v/-verbosity argument. The log file is written to the current working directory.
log_level: WARNING
log_file: ../log_file.log

# Which translation table is used. This value is an integer that corresponds to the tables used by NCBI and parsed
# by biopython. The default is 5, which is the table for invertebrate mitochondrial DNA.
# See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG5
translation_table: 5
59 changes: 59 additions & 0 deletions config/benchmarking/4_test_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Detauls of the repository. This is where the daemon will find local files in pull requests and how it will compose
# the URLs to the various pull request service endpoints. For the endpoints we are quite intimately tied to GitHub
# and the GitHub API (e.g. in the structure of the JSON that is returned), so this is not very flexible.
repo_location: ./
repo_owner: naturalis
repo_name: barcode_validator
pr_db_file: ../examples/pr_status.db

# Alignment configuration. This needs the location of an HMM file for the focal marker. This is used in conjunction with
# `hmmalign` from the HMMER package to align the sequences to the HMM. The assumption is that the executable is in the
# PATH because it is installed in the conda environment as a dependency of the barcode_validator package.
hmm_file: ../examples/COI-5P.hmm

# Which taxonomic level to use for the ID check. The idea is that the expected taxon (in BOLD) and the observed taxa
# (in NCBI) are at this level in the taxonomy and match exactly. So far this has worked well for the family level, and
# this should hold for higher taxa as well.
level: family

# Where to constrain the BLAST search. This indicates the higher taxon level within the NCBI taxonomy to which the
# observed taxa and the expected taxon belong and to which the search should be constrained. This is used to speed up
# the search and to avoid false positives.
constrain: class

# Configuration for local blast. The lower case values are used as arguments to the blastn command. The upper case
# values are used to set environment variables that are used by the blastn command. The BLASTDB environment variable
# is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE environment variable is used to
# specify the amount of RAM allocated to the blastn job.
blast_db: /home/rutger.vos/data/ncbi/nt/nt

# MaaS 39 has 64 cores, so we can use 56 for BLAST and leave 8 for the rest of the system.
num_threads: 4
evalue: 1e-5
max_target_seqs: 10
word_size: 28

# The BLASTDB environment variable is used to specify the location of the BLAST database. The BLASTDB_LMDB_MAP_SIZE
# environment variable is used to specify the amount of RAM allocated to the blastn job. For some reason, setting the
# latter to any value causes the blastn job to fail with a segmentation fault. So let's not do that.
BLASTDB_LMDB_MAP_SIZE: 128G
BLASTDB: /home/rutger.vos/data/ncbi/nt

# Location of the NCBI taxonomy dump. This must be the tar.gz file that contains the nodes.dmp and names.dmp files.
# This corresponds with the dump made available at http://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
ncbi_taxonomy: /home/rutger.vos/data/ncbi/taxdump/taxdump.tar.gz

# Location of the BOLD Excel file. This file is used to match the process IDs (first word in the FASTA file headers)
# to the species names and higher taxon lineages. Instructions on how this file is generated can be found in the
# README.md file in this folder.
bold_sheet_file: ../examples/bold.xlsx

# Configuration for logging. The verbosity level specified here is overridden by the value provided on the command
# line with the -v/-verbosity argument. The log file is written to the current working directory.
log_level: WARNING
log_file: ../log_file.log

# Which translation table is used. This value is an integer that corresponds to the tables used by NCBI and parsed
# by biopython. The default is 5, which is the table for invertebrate mitochondrial DNA.
# See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG5
translation_table: 5
Loading