Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Viral reclassification #75 #81

Open
wants to merge 5 commits into
base: main_unstable
Choose a base branch
from

Conversation

rmcolq
Copy link
Collaborator

@rmcolq rmcolq commented Nov 27, 2024

This PR refactors the classification code so that a second classification round can be performed with a targeted viral database.

To trigger viral reclassification, add --run_viral_reclassification. This will extract the viral+unclassified file only from the first classification round, classify this file with a viral database, then merge the kraken reports and assignment files (overwriting read assignments/counts where a new classification was made, keeping where it wasn't) before continuing with the downstream pipeline steps (hcid/spike detection, read extraction, report generation).

In the process:

  • Database loading/handling has been moved into it's own module
  • Taxonomy loading/handling has been moved into it's own module
  • kraken2_client and kraken2_server modules have been cleaned of processes which are not directly setting up/running/taking down these server/clients
  • kraken_classification just does kraken classification
  • A new subworkflow for classification pulls together the different classification options
  • Classification-related channels now have input/output triples with unique_id,database_name,file for clarity (rather than pairs unique_id,file
  • Remove the option to add additional bracken jsons/use the bracken json for extraction. It added confusion to the workflow and isn't wanted.

Note there are updates to report.py, assignments.py, merge.py and taxonomy.py. These files all exist in a separate project https://github.com/rmcolq/krakenpy (along with unit tests) but given I haven't put it on bioconda they are imported directly at the moment. These classes simplify much of the kraken report handling and I intend to separately continue refactoring other python scripts to import the same classes.

Some command line parameters have changed during this refactor. Namely:

  • the default database name was database_set and is now kraken_database.default.name
  • the host server was k2_host and is now kraken_database.default.host
  • the server port was k2_port and is now kraken_database.default.port
  • the stored database was database and is now kraken_database.default.path
    The config file for climb has had defaults removed which no longer apply - can be updated on climb and saved on github if required.

Note that this has been compared for output changes on tests when virus recalling is switched off and kraken is running against the same database and it doesn't change outputs in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant