Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace BMGE with ClipKit #67

Open
2 of 4 tasks
cmorganl opened this issue Jan 8, 2021 · 1 comment · May be fixed by #93
Open
2 of 4 tasks

Replace BMGE with ClipKit #67

cmorganl opened this issue Jan 8, 2021 · 1 comment · May be fixed by #93
Assignees
Labels
enhancement Highlight something that could be improved. Please be specific, TreeSAPP isn't perfect.

Comments

@cmorganl
Copy link
Collaborator

cmorganl commented Jan 8, 2021

ClipKIT is a new MSA-trimming Python package. The authors indicate the trimmed MSAs generated by ClipKIT are more "desireable" (combined RF distance and bipartition supports) than those from competing tools, including BMGE.

Using ClipKit instead of BMGE would also clean up the installation process, by not having to package the BMGE.jar file with TreeSAPP. It could instead be installed using pip or conda.

  • Write ClipKit helper class for running facilitating trimming of a fasta file
  • Determine optimal parameters to use by comparing classification performance of ClipKIT (gappy, kpic and kpi modes) to BMGE and raw MSA. Evaluation dataset is EggNOG v5.0 against functional and phylogenetic marker reference packages.
  • Remove treesapp/sub_binaries/ directory, and support for BMGE.jar
  • Add ClipKit to requirements.txt and conda recipe
@cmorganl cmorganl added the enhancement Highlight something that could be improved. Please be specific, TreeSAPP isn't perfect. label Jan 8, 2021
@cmorganl cmorganl self-assigned this Jan 8, 2021
@cmorganl cmorganl linked a pull request Jun 7, 2022 that will close this issue
@cmorganl
Copy link
Collaborator Author

ClipKit parameters and settings have been benchmarked using treesapp evaluate. The following code is used to calculate a single error value for the classifications across all taxonomic ranks, weighted by the number of ranks to the correct taxon (i.e. taxonomic distance):

for f in *_evaluate*/final_outputs/clade_exclusion_performance.tsv
    do
    echo $f
    cat $f | awk '{sum+=$5*$7;} END {print sum;}'
done

The parameter set with the lowest score will be used as the default.

@cmorganl cmorganl linked a pull request Oct 9, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Highlight something that could be improved. Please be specific, TreeSAPP isn't perfect.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant