Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference sequences for SARS-CoV-2 #549

Closed
7 tasks done
donkirkby opened this issue Mar 27, 2020 · 12 comments
Closed
7 tasks done

Add reference sequences for SARS-CoV-2 #549

donkirkby opened this issue Mar 27, 2020 · 12 comments

Comments

@donkirkby
Copy link
Member

donkirkby commented Mar 27, 2020

Support SARS-CoV-2 samples by adding the seed reference and other details.

  • Add seed reference.
  • Add coordinate references for gene regions, plus key positions if needed.
  • Add landmarks for genome coverage map, based on gene regions.
  • Handle duplicated base 13468.
  • Add conseq_all.csv with a minimum coverage of 1, and only MAX cutoff.
  • Compare consensus sequences with published results for the same samples.
  • Handle duplicated base when aligning deletions. Split to Align deletions after frame shift #558.
@donkirkby donkirkby added this to the 7.13 milestone Mar 27, 2020
@ArtPoon
Copy link
Contributor

ArtPoon commented Mar 27, 2020

I already did this for the BC CDC - want the JSON?

@donkirkby
Copy link
Member Author

Thanks, Art. Much appreciated!

@donkirkby donkirkby changed the title Add reference sequences for COVID-19 Add reference sequences for SARS-CoV-2 Apr 3, 2020
donkirkby added a commit that referenced this issue Apr 3, 2020
Support FASTQ files from the Sequence Read Archive.
@donkirkby
Copy link
Member Author

Related conversation at PoonLab#42.

@rhliang
Copy link

rhliang commented Apr 8, 2020

Note to self: look at

  • line 827 of aln2counts.py (this is the get_consensus_rows method)
  • line 373 (where it calls self.write_consensus)

donkirkby added a commit that referenced this issue Apr 9, 2020
Add microtest sample for SARS.
Extend orf1a to orf1ab.
@donkirkby
Copy link
Member Author

donkirkby commented Apr 10, 2020

List of samples to download and the toolkit to download with.

Find more samples from SRA by searching for "Severe acute respiratory syndrome-related coronavirus"[orgn:__txid694009]. You can filter by platform, and there are currently 466 Illumina records.

@donkirkby
Copy link
Member Author

Comparing consensus sequences to published results for my first 12 samples:

SRA accession GISAID accession
SRR10903401  
SRR10903402  
SRR11092056  
SRR11092057  
SRR11092058  
SRR11092064  
SRR11140744  
SRR11140746 EPI_ISL_408670
SRR11140748  
SRR11140750  
SRR11177792  
SRR11314339  

@ArtPoon
Copy link
Contributor

ArtPoon commented Apr 15, 2020

You can see our work in progress here (with SRR numbers matched to published accession numbers):
https://github.com/PoonLab/sam2conseq/wiki

I'd be really curious to see what you get!

@donkirkby
Copy link
Member Author

Thanks, @ArtPoon, that will save me a bunch of time. I'll let you know what we find.

@donkirkby
Copy link
Member Author

donkirkby commented Apr 21, 2020

Here's a summary of our results compared to the matches identified by @ArtPoon. (How did you find those, Art? Any suggestions on where to find EPI_ISL_408670? It doesn't show up when I search NCBI.)

Run Compared to Differences
SRR10903401-SARS_S1 MN988669.1 Very good: 12 mismatches in the first 24 bases under low coverage, and 21 extra A's at the end out of 29881.
SRR10903402-SARS_S2 MN988668.1 Almost perfect: 21 extra A's at the end out of 29881.
SRR11092056-SARS_S3 MN996530 Bad: 899 mismatches, 17761 missing, and 217 added out of 29854.
SRR11092057-SARS_S4 MN996528.1 Very good: 4 mismatches, 33 missing, and 12 added out of 29891. Missing 14 at the start, a gap of 15 with no coverage at 5397, plus 4 single gaps of no coverage within 20 bases. The mismatches are all in low coverage, 3 are mixtures when coverage is 2. 12 extra A's at the end..
SRR11092058-SARS_S5 MN996527.1 Bad: lots of sections with no coverage. 38 mismatches, 7606 missing, and 26 added out of 29825.
SRR11092064-SARS_S6 MN996531.1 Bad: lots of sections with no coverage. 24 mismatches, 4667 missing, and 33 added out of 29857.
SRR11140744-SARS_S7 EPI_ISL_408670 Almost perfect: 28 missing from the start, and poly-A tail replaced with ACAGATATATACGCC out of 29879.
SRR11140746-SARS_S8 EPI_ISL_408670 Almost perfect: poly-A tail replaced with AATAWMAACAAACAGAGCCTAAAAAGGACAAAA4 out of 29879.
SRR11140748-SARS_S9 EPI_ISL_408670 Almost perfect: 6 missing from poly-A tail out of 29879.
SRR11140750-SARS_S10 EPI_ISL_408670 Almost perfect: 9 missing from the start, and poly-A tail replaced with ACAATTGCAACAATC out of 29879.
SRR11177792-SARS_S11 MT072688 Almost perfect: 57 added out of 29811. A few added to start, most added at end: AGTGCTGAG + poly-A tail.
SRR11314339-SARS_S12 MT192765 Almost perfect: 38 added out of 29829. A few added to start, most added at end: CCATGTGATTTTAATAG + poly-A tail.

@donkirkby
Copy link
Member Author

I forgot that I already had EPI_ISL_408670 downloaded. It came from GISAID. (Click on Browse, and search by accession id.)

@ArtPoon
Copy link
Contributor

ArtPoon commented Apr 21, 2020

I queried the SRR number in the NCBI SRA database to get the sample description and then searched for a similar description in the GISAID annotations. Not perfect, I know.

@donkirkby
Copy link
Member Author

Closing issue, now that basic reporting is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants