Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LO - 3 - Identifying and dealing with data issues #27

Closed
KatherineCaley opened this issue Oct 17, 2023 · 8 comments
Closed

LO - 3 - Identifying and dealing with data issues #27

KatherineCaley opened this issue Oct 17, 2023 · 8 comments

Comments

@KatherineCaley
Copy link

KatherineCaley commented Oct 17, 2023

  • inconsistent meta-data (data wrangling REFSOIL GenBank files)
    • demonstrate using annotation_db
    • explore using dotplots (e.g., 2 sequences that claim to be the same, dotplot to see if this is the case, ungapped smith waterman score -> p-value)
  • File formats issues
    • Duchene phylip formats, solving using bad_phylip app (Gavin to upload this app)
    • extremely long fasta sequence labels (e.g. making sure you can collate genomes from one species)
@GavinHuttley
Copy link
Contributor

the bad phylip loader app

import re
from collections import defaultdict
from cogent3.app.composable import define_app, LOADER
from cogent3.app import typing as c3types
from cogent3 import make_aligned_seqs, open_data_store

_wspace = re.compile(r"\s+")

@define_app(app_type=LOADER)
def load_bad_phylip(path: c3types.IdentifierType, moltype="dna") -> c3types.AlignedSeqsType:
    data = path.read()  # interim
    data = data.splitlines()
    num_seqs, seq_len = [int(e) for e in data[0].split()]
    labels = []
    seqs = defaultdict(list)
    seq_num = 0
    getting_labels = True
    for line in data[1:]:
        line = line.strip()
        if not line:
            continue
        
        if getting_labels:
            label, line = _wspace.split(line, maxsplit=1)
            labels.append(label)
            if len(labels) == num_seqs:
                getting_labels = False
        else:
            label = labels[seq_num]
            seq_num += 1
            if seq_num == num_seqs:
                seq_num = 0

        seq = _wspace.sub("", line)
        seqs[label].append(seq)
    
    seqs = {l: "".join(s) for l, s in seqs.items()}
    lengths = {len(s) for s in seqs.values()}
    assert lengths == {seq_len}
    return make_aligned_seqs(data=seqs, moltype=moltype)

if __name__ == "__main__":    
    # works on Gavin's machine!
    dstore = open_data_store("~/Desktop/Inbox", suffix="phy", mode="r")
    loader = load_bad_phylip()
    # just doing one file
    print(repr(loader(dstore[0])))

and some sample data
locus-0.phy.zip

@GavinHuttley
Copy link
Contributor

@fredjaya can you find examples of fasta files with overloaded sequence labels and show how to use the label_to_name argument to make_<un>aligned_seqs() or load_<un>aligned_seqs() functions docs are here.

Worth noting it's more difficult for a sequence from GenBank since the data is embedded within the features and doing a sequence rename can break the relationship to features in the annotation_db. (Discuss this with Kath.)

@fredjaya
Copy link

fredjaya commented Oct 31, 2023

Example 1

Data

  • Using sequence EU244602.1
  • Downloaded to file data/EU244602.1.fa

Usage

Loading in an unaligned sequence, with and without removing all label characters after the first whitespace:

from cogent3 import load_unaligned_seqs

fasta_file = "data/EU244602.1.fa"
unstripped_seq = load_unaligned_seqs(fasta_file, moltype="dna")
stripped_seq = load_unaligned_seqs(
        fasta_file, moltype="dna", label_to_name=lambda x: x.split()[0])
)

print(unstripped_seq)
print(stripped_seq)

>EU244602.1 Cornechiniscus lobatus from Egypt cytochrome oxidase subunit I-like
(COI) gene, partial sequence; mitochondrial
GGTCAACAAATCATAAAGATATTGGCACTTTGTATTTTATTTTTGGGTTCTGATCTGCCT
CTGTCGGCTCTAGATTTAGACTTATTATTCGAACAGAATTGTCTCAACCTGGTATTTTTC
TTGGGGATGAGCATTTATATAATGTTTTGGTTACTTCCCACGCTTTAGTTATAATTTTCT
TTATAGTTATGCCAATTTTAATTGGAGGTTTTGGTAATTGATTAGTGCCCCTTATAATTG
GGGCTCCGGATATGGCGTTCCCTCGGATGAATAATTTGAGTTTTTGACTTTTACCACCTG
CTCTCATGCTACTTTTAATATCTTCTAATACTTATGCGGGGGCCGGAACCGGCTGGACTC
TTTATCCACCTCTATCAAGCTACTCAGGCCATTCTAATGTGACTGTTGATATAGCTATTT
TTTCTTTGCATATTGCCGGAGCTTCTTCTATTTTGGGTGCTATTAATTTTATTACTACTA
TTTTTAACATGCGCTCTTTATCTCTGAGTCTTGAAAATATAAGTTTATTTGTTTGATCTG
TTTTGATTACTGCCTTTTTACTTCTTTTTTCTCTACCTGTTTTAGCAGGTGGTATTACCA
TACTTTTAATAGATCGTAATTTTAATAGCTCATTTTTTGATCCTTCCGGGGGTGGTGATG
CCAATTCTTTTTCAGCANGAATTTTGATTTTTTGGTCACCCTGAAGTTTA

>EU244602.1
GGTCAACAAATCATAAAGATATTGGCACTTTGTATTTTATTTTTGGGTTCTGATCTGCCT
CTGTCGGCTCTAGATTTAGACTTATTATTCGAACAGAATTGTCTCAACCTGGTATTTTTC
TTGGGGATGAGCATTTATATAATGTTTTGGTTACTTCCCACGCTTTAGTTATAATTTTCT
TTATAGTTATGCCAATTTTAATTGGAGGTTTTGGTAATTGATTAGTGCCCCTTATAATTG
GGGCTCCGGATATGGCGTTCCCTCGGATGAATAATTTGAGTTTTTGACTTTTACCACCTG
CTCTCATGCTACTTTTAATATCTTCTAATACTTATGCGGGGGCCGGAACCGGCTGGACTC
TTTATCCACCTCTATCAAGCTACTCAGGCCATTCTAATGTGACTGTTGATATAGCTATTT
TTTCTTTGCATATTGCCGGAGCTTCTTCTATTTTGGGTGCTATTAATTTTATTACTACTA
TTTTTAACATGCGCTCTTTATCTCTGAGTCTTGAAAATATAAGTTTATTTGTTTGATCTG
TTTTGATTACTGCCTTTTTACTTCTTTTTTCTCTACCTGTTTTAGCAGGTGGTATTACCA
TACTTTTAATAGATCGTAATTTTAATAGCTCATTTTTTGATCCTTCCGGGGGTGGTGATG
CCAATTCTTTTTCAGCANGAATTTTGATTTTTTGGTCACCCTGAAGTTTA

Example 2

Data

Usage

This time using load_aligned_seqs, and labels are delimited by a character instead of whitespace.

from cogent3 import load_aligned_seqs

fasta_file = "data/Human_BRCA2_orthologues.fa"

unstripped_seq = load_aligned_seqs(fasta_file, moltype="dna")
stripped_seq = load_aligned_seqs(
        fasta_file, moltype="dna", label_to_name=lamba x: x.split("_")[0]
)

print(unstripped_seq.names) # names only; sequences are long
print(stripped_seq.names)

['ENSCSAP00000013938_Csab/1-10260', 'ENSOGAP00000009477_Ogar/1-10218', 'ENSCATP00000015406_Caty/1-10257', 'ENSNLEP00000001277_Nleu/1-10257', 'ENSMNEP00000025034_Mnem/1-10278', 'ENSPCOP00000008219_Pcoq/1-10257', 'ENSPSMP00000026900_Psim/1-10317', 'ENSANAP00000015411_Anan/1-10212'
, 'ENSGGOP00000027226_Ggor/1-10248', 'ENSPTRP00000009812_Ptro/1-10254', 'ENSRROP00000030947_Rrox/1-10254', 'ENSMICP00000044704_Mmur/1-8847', 'ENSRBIP00000000297_Rbie/1-1158', 'ENSPPAP00000000836_Ppan/1-10251', 'ENSMLEP00000025266_Mleu/1-11025', 'ENSTSYP00000000441_Csyr/1-10173',
'ENSMMUP00000009432_Mmul/1-10131', 'ENSMFAP00000029418_Mfas/1-10290', 'ENSCJAP00000093240_Cjac/1-9963', 'ENSPPYP00000005997_Pabe/1-10245', 'ENSPANP00000022490_Panu/1-10260', 'ENSP00000369497_Hsap/1-10254', 'ENSCCAP00000031676_Cimi/1-10191']

['ENSCSAP00000013938', 'ENSOGAP00000009477', 'ENSCATP00000015406', 'ENSNLEP00000001277', 'ENSMNEP00000025034', 'ENSPCOP00000008219', 'ENSPSMP00000026900', 'ENSANAP00000015411', 'ENSGGOP00000027226', 'ENSPTRP00000009812', 'ENSRROP00000030947', 'ENSMICP00000044704', 'ENSRBIP00000000297', 'ENSPPAP00000000836', 'ENSMLEP00000025266', 'ENSTSYP00000000441', 'ENSMMUP00000009432', 'ENSMFAP00000029418', 'ENSCJAP00000093240', 'ENSPPYP00000005997', 'ENSPANP00000022490', 'ENSP00000369497', 'ENSCCAP00000031676']

@fredjaya fredjaya assigned GavinHuttley and unassigned fredjaya Oct 31, 2023
@GavinHuttley GavinHuttley changed the title LO - Identifying and dealing with data issues LO - 3 - Identifying and dealing with data issues Nov 12, 2023
@KatherineCaley
Copy link
Author

KatherineCaley commented Nov 12, 2023

File formats issues (bad phylip) is ported to issue #37

@KatherineCaley
Copy link
Author

data wrangling is in #28

@KatherineCaley
Copy link
Author

collapsing this issue into #28

@KatherineCaley
Copy link
Author

Closing as contented related to this issue will be described in #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants