Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune by distance from seed reference #290

Closed
donkirkby opened this issue Feb 19, 2016 · 1 comment
Closed

Prune by distance from seed reference #290

donkirkby opened this issue Feb 19, 2016 · 1 comment
Assignees

Comments

@donkirkby
Copy link
Member

Add a step in remap that looks at how far the consensus has moved from the seed reference, and rejects any seeds that have moved more than some threshold, like 5% of the reference length. If any seeds are rejected, do another mapping with the remaining references.

Look at the Hamming or Levenshtein distance between the different references in each seed group to decide on a threshold.

As an example of a sample that moved too far from the seed reference, see samples 61673AWG1 and 61673AWG2 in the run from Mar 1 2016. The two samples are from the same extraction, but WG1 reports a mutation at NS3 155 and WG2 doesn't. We suspect that all the reads with the mutation mapped to genotype 1B and got ignored.

@donkirkby donkirkby added this to the near future milestone Feb 19, 2016
@donkirkby donkirkby changed the title Prune by Hamming distance Prune by distance from seed reference Mar 14, 2016
donkirkby added a commit that referenced this issue Mar 15, 2016
@donkirkby
Copy link
Member Author

I tried comparing all the different references with a few different techniques. For example, this chart shows the comparison when I calculate the Levenshtein distance to each reference. For each genotype, I calculate a median reference among all the references in that genotype. Then I display the distance from that median to all the other references. Green dots are references within the same genotype, and red dots are references in other genotypes. Genotypes 1b, 5, and 7 only have one reference each, so they are equal to their median.

Reference distances from median

Most of the references within each group are within a distance of about 1500, so I'm going to try using that as a limit. Genotype 6 may cause problems, so I'll review some samples to see how they look.

For comparison, here are some other techniques I used to compare the references:

Reference distances from first

Reference distances from median by key regions

Reference distances from first by key regions

They all have more overlap than the first one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants