Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a function to inspect an ALT allele of VCF #164

Merged
merged 5 commits into from
May 24, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions src/cljam/io/vcf/util.clj
Original file line number Diff line number Diff line change
Expand Up @@ -286,3 +286,93 @@
(if before? \[ \]))]
(str (when-not before? s) bracket chr \: pos bracket (when before? s)))
(str (when (= :before join) \.) s (when (= :after join) \.)))))

(defn- inspect-nucleotides-allele
[ref alt]
(let [ref-length (count ref)
alt-length (count alt)
upper-ref (cstr/upper-case ref)
upper-alt (cstr/upper-case alt)
left-match (->> (map = upper-ref upper-alt)
(take-while true?)
count)
right-match (->> (cstr/reverse upper-alt)
(map = (cstr/reverse upper-ref))
(take-while true?)
count
long
(Math/min (- (Math/min ref-length alt-length)
left-match)))
matched-length (+ left-match right-match)]
;; Note that `{:type :ref}` is handled in the caller
(cond
(and (= left-match ref-length) (< left-match alt-length))
{:type :insertion, :offset (dec left-match),
:n-bases (- alt-length left-match), :inserted (subs alt left-match)}

(and (< left-match ref-length) (= left-match alt-length))
{:type :deletion, :offset (dec left-match),
:n-bases (- left-match ref-length), :deleted (subs ref left-match)}
r6eve marked this conversation as resolved.
Show resolved Hide resolved

(= (inc matched-length) ref-length alt-length)
{:type :snv, :ref (nth ref left-match),
:alt (nth alt left-match), :offset left-match}

(= matched-length alt-length)
{:type :deletion, :offset (dec left-match),
:n-bases (- left-match (- ref-length right-match)),
:deleted (subs ref left-match (- ref-length right-match))}

(= matched-length ref-length)
{:type :insertion, :offset (dec left-match),
:n-bases (- alt-length right-match left-match),
:inserted (subs alt left-match (- alt-length right-match))}

(= ref-length alt-length)
{:type :mnv, :offset left-match,
:ref (subs ref left-match (- ref-length right-match)),
:alt (subs alt left-match (- alt-length right-match))}

:else {:type :complex})))

(defn inspect-allele
"Inspects an `alt` allele by comparing to a `ref` allele string.
Returns a map containing `:type` and other detailed information.
A value of the key `:type` can be one of the followings:
- `:ref` Non-variant allele
- `:id` Symbolic reference
- `:snv` Single nucleotide variant
- `:mnv` Multiple nucleotide variants
- `:insertion` Insertion of a short base sequence
- `:deletion` Deletion of a short base sequence
- `:complete-insertion` Complete insertion of a long sequence
- `:breakend` Breakend of a complex rearrangement
- `:complex` Complex nucleotide variants other than snv/mnv/indel
- `:other` Can't categorize the allele, might be malformed"
[ref alt]
(let [alt-length (count alt)]
(cond
(or (#{"." "*" "X" "<*>" "<X>"} alt)
(= (cstr/upper-case ref) (cstr/upper-case alt)))
{:type :ref}

(re-matches #"<.+>" alt)
{:type :id, :id (subs alt 1 (dec alt-length))}

(re-matches #"(?i)([ACGTN]<.+>|<.+>[ACGTN])" alt)
(if (cstr/starts-with? alt "<")
{:type :complete-insertion,
:join :before,
:base (last alt),
:id (subs alt 1 (- alt-length 2))}
{:type :complete-insertion,
:join :after,
:base (first alt),
:id (subs alt 2 (dec alt-length))})
r6eve marked this conversation as resolved.
Show resolved Hide resolved

:else (or (some-> (parse-breakend alt)
(assoc :type :breakend))
(if (and (re-matches #"(?i)[ACGTN]+" ref)
(re-matches #"(?i)[ACGTN]+" alt))
(inspect-nucleotides-allele ref alt)
{:type :other})))))
72 changes: 72 additions & 0 deletions test/cljam/io/vcf/util_test.clj
Original file line number Diff line number Diff line change
Expand Up @@ -336,3 +336,75 @@
".A" {:bases "A", :join :before, :chr "1", :pos 1}
".A" {:bases "A", :join :before, :chr "1", :strand :forward}
".A" {:bases "A", :join :before, :pos 1, :strand :forward}))

(deftest inspect-allele
(are [?ref ?alt ?expected]
(= ?expected (vcf-util/inspect-allele ?ref ?alt))

"A" "." {:type :ref}
"A" "*" {:type :ref}
"A" "X" {:type :ref}
"A" "A" {:type :ref}
"A" "a" {:type :ref}
"AA" "AA" {:type :ref}
"AA" "Aa" {:type :ref}
"A" "<*>" {:type :ref} ;; conventional
"A" "<X>" {:type :ref} ;; conventional

"A" "<DUP>" {:type :id, :id "DUP"}
"A" "<INV>" {:type :id, :id "INV"}
"A" "<NON_REF>" {:type :id, :id "NON_REF"} ;; not :ref

"A" "T" {:type :snv, :ref \A, :alt \T, :offset 0}
"TC" "TG" {:type :snv, :ref \C, :alt \G, :offset 1}
"TC" "GC" {:type :snv, :ref \T, :alt \G, :offset 0}
"TCA" "TAA" {:type :snv, :ref \C, :alt \A, :offset 1}
"TCAG" "tcAC" {:type :snv, :ref \G, :alt \C, :offset 3}
"TGTAT" "TGcAT" {:type :snv, :ref \T, :alt \c, :offset 2}

"TG" "Gc" {:type :mnv, :ref "TG", :alt "Gc", :offset 0}
"TCG" "tGA" {:type :mnv, :ref "CG", :alt "GA", :offset 1}
"TGCA" "TCCC" {:type :mnv, :ref "GCA", :alt "CCC", :offset 1} ;; two snvs?
"TGGTA" "TCcAA" {:type :mnv, :ref "GGT", :alt "CcA", :offset 1}

"TC" "T" {:type :deletion, :n-bases -1, :offset 0, :deleted "C"}
"TTC" "TC" {:type :deletion, :n-bases -1, :offset 0, :deleted "T"}
"GTC" "G" {:type :deletion, :n-bases -2, :offset 0, :deleted "TC"}
"TCG" "TG" {:type :deletion, :n-bases -1, :offset 0, :deleted "C"}
"TGCA" "TGC" {:type :deletion, :n-bases -1, :offset 2, :deleted "A"}
;; ambiguous
"TCGCG" "TCG" {:type :deletion, :n-bases -2, :offset 2, :deleted "CG"}
;; ambiguous
"TAAACCCTAAA" "TAA" {:type :deletion, :n-bases -8,
:offset 2, :deleted "ACCCTAAA"}

"C" "CTAG" {:type :insertion, :n-bases 3, :offset 0, :inserted "TAG"}
"TC" "TTC" {:type :insertion, :n-bases 1, :offset 0, :inserted "T"}
"GTC" "GTCT" {:type :insertion, :n-bases 1, :offset 2, :inserted "T"}
"TCG" "TCAG" {:type :insertion, :n-bases 1, :offset 1, :inserted "A"}
;; ambiguous
"GCG" "GCGCG" {:type :insertion, :n-bases 2, :offset 2, :inserted "CG"}
;; ambiguous
"ATTTTTTTTTTTTT" "ATTTTTTTTTTTTTTT" {:type :insertion, :n-bases 2,
:offset 13, :inserted "TT"}

"T" "<ctg1>C" {:type :complete-insertion, :join :before,
:base \C, :id "ctg1"}
"T" "c<ctg1>" {:type :complete-insertion, :join :after,
:base \c, :id "ctg1"}

"G" "G." {:type :breakend, :join :after, :bases "G"}
"A" ".A" {:type :breakend, :join :before, :bases "A"}
"T" "TCC." {:type :breakend, :join :after, :bases "TCC"}
"A" ".TGCA" {:type :breakend, :join :before, :bases "TGCA"}
"T" "]13:123456]T" {:type :breakend, :join :before,
:strand :forward, :chr "13", :pos 123456, :bases "T"}

"T" "<ctg1>CT" {:type :other} ;; invalid
"T" "<CT" {:type :other} ;; invalid
"." "A" {:type :other}

"TAC" "GC" {:type :complex}
"TG" "TAC" {:type :complex}
"TCA" "GGGG" {:type :complex}
"AAAA" "CAC" {:type :complex}))