-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification in relate command across genome builds #108
Comments
Also in this release, in the example, the docker tag is set as 0.2.1.6 instead of 0.2.16 |
Hi, the releases page indicates that the RNA sites file is different coordinates from the others: https://github.com/brentp/somalier/releases/tag/v0.2.16 I updated that last week and made it more clear, I think. I've now updated the docker tag. Thanks for reporting. |
Thanks for clarifying @brentp, for me it wasn't clear from the release page that just because the sites file had a different number of sites, fingerprints generated with this sites file wouldn't be compatible with the rest of the references. Would it be possible to have a hg19 RNA sites file? I can create a separate issue for this request. Your input would be very much appreciated! Alexis |
It is a bit of a pain to adjust the few troublesome SNPs (12 at this case) but here it is sites.chm13v2.rna.vcf.gz The troublesome sites (deleted in T2T assembly) are below, placed on (hopefully) correct side and near the closest correctly liftedover SNP. The chm13v2 coordinates are on columns 1 and 2.
Here's somalier relate I got for one sample of Nanopore WGS with GRCh38 and chm13v2 sites (same data, different basecall and reference).
|
thank you! @alexiswl does this work for you? if so I'll add to the sites in the releases. |
@alexiswl , I think it's best to re-extract with a pair of matching sites. |
@brentp would a slightly out of order vcf work? For the 24 SNPs that couldn't be mapped, some are in locations that are in region lower / or higher than the next snp. i.e merging the liftover and rejected we get
Where 147905175 and 149818561 are references in the hg19 reference and from hg38 positions 148433037, and 149846994 respectively. 149029906 is a hg38 position in the rejected vcf. I generated a 200 bp fastq for each of the rejected variants, re aligned them to both hg38 and hg19 to find their corresponding mapping positions and for 149029906 this became 144854487 (on the complement strand of the hg19 reference) - see screenshot below. So a part of the genome that has been flipped between the hg19 and hg38 reference If I update this new vcf as
The new vcfs technically now out of position order 147905175, 144854487, 149818561 but is of the same order as the original hg38 (necessary since the somalier binary does not store positional information as far as I know). Would this cause an error when running the somalier extract command? Or is this a valid workaround? I'd rather not have to re-extract the sites from both our WGS and TSO data Alexis |
Hi Brent, I've built a hg19 rna sites file (that's slightly out of order, necessary to maintain positional memory from the hg38 sites file). I've managed to test concordance across genome builds, with bam files from the same subject aligned to hg38 and hg19 references and it works great! Attached is the notebook (in zip) used to generate the file AND the output sites file itself. I was able to find 15 of these variants easily through vcf -> bed -> fastq remap -> manual IGV assessment. For six of the remaining 9 variants, I was able to approximate the target by determining the expected distance between the adjacent successful targets and used a position in the hg19 WGS sites file that resided mostly closely to this expected position. For the remaining three variants, the expected position of these on hg19 was N. I was able to leave these as 'N' without consequence. |
In the v0.2.16 release it states
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
When running
I get
So assume this cross-reference does not apply for somalier files generated with the hg38 rna sites file?
The text was updated successfully, but these errors were encountered: