Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to obtain raw data for reproduction #1

Open
d-cameron opened this issue Apr 19, 2020 · 3 comments
Open

Unable to obtain raw data for reproduction #1

d-cameron opened this issue Apr 19, 2020 · 3 comments
Assignees

Comments

@d-cameron
Copy link

Hello.

I just saw your preprint go up and was wanting to generate ROC curves from the caller qual scores to compare to your single-point results but it appears the raw data is not available. Would you be able to update your repo and readme with :

  • A link to the google drive location of the bams and/or fastqs.
    • Some aligners have significantly elevated FP rates when using bwa mem -a and am curious what the effect of your choice of aligner settings has on your results
  • Include the raw VCF files as output by the caller.
  • https://github.com/Mangul-Lab-USC/benchmarking-sv-callers-paper/tree/master/Data/raw_data/mouse only includes your processed results, not the raw VCFs. I am unable to generate ROC curves because you've stripped the essential information when generating those subset files (ie QUAL, and FILTER).
@Addicted-to-coding
Copy link
Member

Addicted-to-coding commented Apr 21, 2020

Hello,
We have uploaded the raw VCF files for GRIDSS here and we are working on uploading the files for other tools. We will also upload the original fastq and bam files to SRA/ENA.

We weren't able to find any clear documentation on how to use QUAL, and FILTER so those were ignored. We are happy to incorporate it if you could provide some instructions on how to use them.

@d-cameron
Copy link
Author

I used QUAL to generate ROC curves which are more informative than just single points. For callers not reporting a QUAL score I use # supporting reads as a proxy as that's typically what's used in a as cut-off. In your case Figures 1ef, and 3ef would benefit from QUAL lines. Similarly, using FILTER to split out caller results into a FILTER=PASS subset, and an all call subset is valuable.

Your Fig 2 is also a bit strange. I would have thought Fig2 should plot only TPs with an x axis of 'length of called variant - true length of variant'. As it is, it appears to be plotting whether the caller makes more small or large del calls, not how accurately the caller reports the length of the variants that it does call.

Happy to unofficially review your preprint if you're interested in more comprehensive feedback.

Cheers
Daniel Cameron

@smangul1
Copy link
Member

smangul1 commented Dec 2, 2020

Thanks, Daniel for your feedback!

We intentionally did plot all SVs not just TPs to see how the distribution of inferred SVs is different from the true ones
But we are happy to make a plot with just TPs, this will help us access how accurately the tools can estimate the length of correctly detected SVs (probably with a 1000bp threshold)

Varuni,
can you please generate this plot

In terms of QUAL we will be happy to incorporate this in the future

Serghei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants