Viral reclassification #75 #81

rmcolq · 2024-11-27T11:59:02Z

This PR refactors the classification code so that a second classification round can be performed with a targeted viral database.

To trigger viral reclassification, add --run_viral_reclassification. This will extract the viral+unclassified file only from the first classification round, classify this file with a viral database, then merge the kraken reports and assignment files (overwriting read assignments/counts where a new classification was made, keeping where it wasn't) before continuing with the downstream pipeline steps (hcid/spike detection, read extraction, report generation).

In the process:

Database loading/handling has been moved into it's own module
Taxonomy loading/handling has been moved into it's own module
kraken2_client and kraken2_server modules have been cleaned of processes which are not directly setting up/running/taking down these server/clients
kraken_classification just does kraken classification
A new subworkflow for classification pulls together the different classification options
Classification-related channels now have input/output triples with unique_id,database_name,file for clarity (rather than pairs unique_id,file
Remove the option to add additional bracken jsons/use the bracken json for extraction. It added confusion to the workflow and isn't wanted.

Note there are updates to report.py, assignments.py, merge.py and taxonomy.py. These files all exist in a separate project https://github.com/rmcolq/krakenpy (along with unit tests) but given I haven't put it on bioconda they are imported directly at the moment. These classes simplify much of the kraken report handling and I intend to separately continue refactoring other python scripts to import the same classes.

Some command line parameters have changed during this refactor. Namely:

the default database name was database_set and is now kraken_database.default.name
the host server was k2_host and is now kraken_database.default.host
the server port was k2_port and is now kraken_database.default.port
the stored database was database and is now kraken_database.default.path
The config file for climb has had defaults removed which no longer apply - can be updated on climb and saved on github if required.

Note that this has been compared for output changes on tests when virus recalling is switched off and kraken is running against the same database and it doesn't change outputs in this case.

…ome parameter name changes but gives exactly the same output

… not all reads

rmcolq added 5 commits November 26, 2024 16:27

refactor kraken running ahead of adding viral classification step - s…

a0ee907

…ome parameter name changes but gives exactly the same output

add option to reclassify viral and unclassified fraction

7f2e230

bugfixes while testing

94e6a99

update after more testing

7e0a019

be explicit about the percentage threshold being of classified reads,…

9386803

… not all reads

rmcolq mentioned this pull request Nov 29, 2024

Add virus kraken reclassification #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Viral reclassification #75 #81

Viral reclassification #75 #81

rmcolq commented Nov 27, 2024 •

edited

Loading

Viral reclassification #75 #81

Are you sure you want to change the base?

Viral reclassification #75 #81

Conversation

rmcolq commented Nov 27, 2024 • edited Loading

rmcolq commented Nov 27, 2024 •

edited

Loading