Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembler benchmarking #72

Closed
rcedgar opened this issue May 3, 2020 · 7 comments
Closed

Assembler benchmarking #72

rcedgar opened this issue May 3, 2020 · 7 comments
Assignees
Labels
Bioinformatics Bioinformatics task enhancement New feature or request good first issue Good for newcomers

Comments

@rcedgar
Copy link
Collaborator

rcedgar commented May 3, 2020

See notes on how to benchmark assemblers here:

200503_rce_assembler_benchmark_notes.pdf

Anyone up for taking on this task?

@rcedgar rcedgar added Bioinformatics Bioinformatics task enhancement New feature or request good first issue Good for newcomers labels May 3, 2020
@ababaian
Copy link
Owner

ababaian commented May 3, 2020

@JustinChu / @taltman this will be up your alley for measuring how 'good' we can assemble new CoV.

@rcedgar rcedgar changed the title Assembler benchmark testing -- volunteer needed Assembler benchmark testing May 4, 2020
@JustinChu
Copy link
Collaborator

JustinChu commented May 6, 2020

Hi @rcedgar
Can you point me to the pan genomes/datasets that you created for the alignment experiements?

My current protocol to evaluation that I was thinking of doing was as follows:

  • Input: library with known SARS or hCov-19 (as in the PDF)
  • Pangenome (w/o a specific strain like SARS or hCov-19)
  • Align reads with same protocol as standard pipeline from a COV+ dataset of removed strain. Use only reads that map -> generate contigs (test multiple methods)
  • Evaluate contigs (completeness, contiguity, mis-assemblies, etc.) on SARS or hCov-19 reference.

I'll need a pangenome with the strain being tested removed (is up to 80% what we have tested?), the reference sequence of strain and maybe libraries positive for the strain (may be able to simply simulate data instead).

Maybe you could just make clear what your folder on the s3 bucket contain so I can perhap reuse them. For instance what do the fasta files in the /r or /q directories contain?

@rcedgar
Copy link
Collaborator Author

rcedgar commented May 6, 2020

"I'll need a pangenome with the strain being tested removed (is up to 80% what we have tested?)" -- yes, exactly! See benchmark notes here which explain the s3 files:

200430_covx_benchmark_howto.pdf

@JustinChu
Copy link
Collaborator

Ah, that is what I was looking for, thanks!

@taltman
Copy link
Collaborator

taltman commented May 15, 2020

I think a lot of what we want to do here can be done using MetaQUAST:
http://quast.sourceforge.net/metaquast

One thing that it is suboptimal in performing is in aligning the short reads back to the assemblies. Takes forever. Perhaps that is an optimization that @rcedgar would be best positioned to tackle? That is my recollection with a large metagenomics assembly from a human gut sample. We'll generate some data on how it runs with our filtered reads, and perhaps will need to address performance if it is still an issue.

@taltman taltman changed the title Assembler benchmark testing Assembler benchmarking May 15, 2020
@taltman taltman added this to the Assembly: Validation milestone May 15, 2020
@rcedgar
Copy link
Collaborator Author

rcedgar commented May 15, 2020

This is my best attempt at writing a very fast read mapper:

https://drive5.com/urmap/manual/downloads.html

@ababaian
Copy link
Owner

Closed by #130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bioinformatics Bioinformatics task enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants