LO - 3 - Identifying and dealing with data issues #27

KatherineCaley · 2023-10-17T22:59:26Z

inconsistent meta-data (data wrangling REFSOIL GenBank files)
- demonstrate using annotation_db
- explore using dotplots (e.g., 2 sequences that claim to be the same, dotplot to see if this is the case, ungapped smith waterman score -> p-value)
File formats issues
- Duchene phylip formats, solving using bad_phylip app (Gavin to upload this app)
- extremely long fasta sequence labels (e.g. making sure you can collate genomes from one species)

GavinHuttley · 2023-10-19T02:10:02Z

the bad phylip loader app

import re
from collections import defaultdict
from cogent3.app.composable import define_app, LOADER
from cogent3.app import typing as c3types
from cogent3 import make_aligned_seqs, open_data_store

_wspace = re.compile(r"\s+")

@define_app(app_type=LOADER)
def load_bad_phylip(path: c3types.IdentifierType, moltype="dna") -> c3types.AlignedSeqsType:
    data = path.read()  # interim
    data = data.splitlines()
    num_seqs, seq_len = [int(e) for e in data[0].split()]
    labels = []
    seqs = defaultdict(list)
    seq_num = 0
    getting_labels = True
    for line in data[1:]:
        line = line.strip()
        if not line:
            continue
        
        if getting_labels:
            label, line = _wspace.split(line, maxsplit=1)
            labels.append(label)
            if len(labels) == num_seqs:
                getting_labels = False
        else:
            label = labels[seq_num]
            seq_num += 1
            if seq_num == num_seqs:
                seq_num = 0

        seq = _wspace.sub("", line)
        seqs[label].append(seq)
    
    seqs = {l: "".join(s) for l, s in seqs.items()}
    lengths = {len(s) for s in seqs.values()}
    assert lengths == {seq_len}
    return make_aligned_seqs(data=seqs, moltype=moltype)

if __name__ == "__main__":    
    # works on Gavin's machine!
    dstore = open_data_store("~/Desktop/Inbox", suffix="phy", mode="r")
    loader = load_bad_phylip()
    # just doing one file
    print(repr(loader(dstore[0])))

and some sample data
locus-0.phy.zip

GavinHuttley · 2023-10-19T02:18:43Z

@fredjaya can you find examples of fasta files with overloaded sequence labels and show how to use the label_to_name argument to make_<un>aligned_seqs() or load_<un>aligned_seqs() functions docs are here.

Worth noting it's more difficult for a sequence from GenBank since the data is embedded within the features and doing a sequence rename can break the relationship to features in the annotation_db. (Discuss this with Kath.)

fredjaya · 2023-10-19T03:22:11Z

Will update this message as I come across new examples:

fredjaya · 2023-10-31T01:22:01Z

Example 1

Data

Using sequence EU244602.1
Downloaded to file data/EU244602.1.fa

Usage

Loading in an unaligned sequence, with and without removing all label characters after the first whitespace:

from cogent3 import load_unaligned_seqs

fasta_file = "data/EU244602.1.fa"
unstripped_seq = load_unaligned_seqs(fasta_file, moltype="dna")
stripped_seq = load_unaligned_seqs(
        fasta_file, moltype="dna", label_to_name=lambda x: x.split()[0])
)

print(unstripped_seq)
print(stripped_seq)

>EU244602.1 Cornechiniscus lobatus from Egypt cytochrome oxidase subunit I-like
(COI) gene, partial sequence; mitochondrial
GGTCAACAAATCATAAAGATATTGGCACTTTGTATTTTATTTTTGGGTTCTGATCTGCCT
CTGTCGGCTCTAGATTTAGACTTATTATTCGAACAGAATTGTCTCAACCTGGTATTTTTC
TTGGGGATGAGCATTTATATAATGTTTTGGTTACTTCCCACGCTTTAGTTATAATTTTCT
TTATAGTTATGCCAATTTTAATTGGAGGTTTTGGTAATTGATTAGTGCCCCTTATAATTG
GGGCTCCGGATATGGCGTTCCCTCGGATGAATAATTTGAGTTTTTGACTTTTACCACCTG
CTCTCATGCTACTTTTAATATCTTCTAATACTTATGCGGGGGCCGGAACCGGCTGGACTC
TTTATCCACCTCTATCAAGCTACTCAGGCCATTCTAATGTGACTGTTGATATAGCTATTT
TTTCTTTGCATATTGCCGGAGCTTCTTCTATTTTGGGTGCTATTAATTTTATTACTACTA
TTTTTAACATGCGCTCTTTATCTCTGAGTCTTGAAAATATAAGTTTATTTGTTTGATCTG
TTTTGATTACTGCCTTTTTACTTCTTTTTTCTCTACCTGTTTTAGCAGGTGGTATTACCA
TACTTTTAATAGATCGTAATTTTAATAGCTCATTTTTTGATCCTTCCGGGGGTGGTGATG
CCAATTCTTTTTCAGCANGAATTTTGATTTTTTGGTCACCCTGAAGTTTA

>EU244602.1
GGTCAACAAATCATAAAGATATTGGCACTTTGTATTTTATTTTTGGGTTCTGATCTGCCT
CTGTCGGCTCTAGATTTAGACTTATTATTCGAACAGAATTGTCTCAACCTGGTATTTTTC
TTGGGGATGAGCATTTATATAATGTTTTGGTTACTTCCCACGCTTTAGTTATAATTTTCT
TTATAGTTATGCCAATTTTAATTGGAGGTTTTGGTAATTGATTAGTGCCCCTTATAATTG
GGGCTCCGGATATGGCGTTCCCTCGGATGAATAATTTGAGTTTTTGACTTTTACCACCTG
CTCTCATGCTACTTTTAATATCTTCTAATACTTATGCGGGGGCCGGAACCGGCTGGACTC
TTTATCCACCTCTATCAAGCTACTCAGGCCATTCTAATGTGACTGTTGATATAGCTATTT
TTTCTTTGCATATTGCCGGAGCTTCTTCTATTTTGGGTGCTATTAATTTTATTACTACTA
TTTTTAACATGCGCTCTTTATCTCTGAGTCTTGAAAATATAAGTTTATTTGTTTGATCTG
TTTTGATTACTGCCTTTTTACTTCTTTTTTCTCTACCTGTTTTAGCAGGTGGTATTACCA
TACTTTTAATAGATCGTAATTTTAATAGCTCATTTTTTGATCCTTCCGGGGGTGGTGATG
CCAATTCTTTTTCAGCANGAATTTTGATTTTTTGGTCACCCTGAAGTTTA

Example 2

Data

Ensembl compara orthologues of BRCA2 in primates
Selected "Primates (23 species)
Humans and other primates"
"Download orthologues"
Download as .fasta

Usage

This time using load_aligned_seqs, and labels are delimited by a character instead of whitespace.

from cogent3 import load_aligned_seqs

fasta_file = "data/Human_BRCA2_orthologues.fa"

unstripped_seq = load_aligned_seqs(fasta_file, moltype="dna")
stripped_seq = load_aligned_seqs(
        fasta_file, moltype="dna", label_to_name=lamba x: x.split("_")[0]
)

print(unstripped_seq.names) # names only; sequences are long
print(stripped_seq.names)

['ENSCSAP00000013938_Csab/1-10260', 'ENSOGAP00000009477_Ogar/1-10218', 'ENSCATP00000015406_Caty/1-10257', 'ENSNLEP00000001277_Nleu/1-10257', 'ENSMNEP00000025034_Mnem/1-10278', 'ENSPCOP00000008219_Pcoq/1-10257', 'ENSPSMP00000026900_Psim/1-10317', 'ENSANAP00000015411_Anan/1-10212'
, 'ENSGGOP00000027226_Ggor/1-10248', 'ENSPTRP00000009812_Ptro/1-10254', 'ENSRROP00000030947_Rrox/1-10254', 'ENSMICP00000044704_Mmur/1-8847', 'ENSRBIP00000000297_Rbie/1-1158', 'ENSPPAP00000000836_Ppan/1-10251', 'ENSMLEP00000025266_Mleu/1-11025', 'ENSTSYP00000000441_Csyr/1-10173',
'ENSMMUP00000009432_Mmul/1-10131', 'ENSMFAP00000029418_Mfas/1-10290', 'ENSCJAP00000093240_Cjac/1-9963', 'ENSPPYP00000005997_Pabe/1-10245', 'ENSPANP00000022490_Panu/1-10260', 'ENSP00000369497_Hsap/1-10254', 'ENSCCAP00000031676_Cimi/1-10191']

['ENSCSAP00000013938', 'ENSOGAP00000009477', 'ENSCATP00000015406', 'ENSNLEP00000001277', 'ENSMNEP00000025034', 'ENSPCOP00000008219', 'ENSPSMP00000026900', 'ENSANAP00000015411', 'ENSGGOP00000027226', 'ENSPTRP00000009812', 'ENSRROP00000030947', 'ENSMICP00000044704', 'ENSRBIP00000000297', 'ENSPPAP00000000836', 'ENSMLEP00000025266', 'ENSTSYP00000000441', 'ENSMMUP00000009432', 'ENSMFAP00000029418', 'ENSCJAP00000093240', 'ENSPPYP00000005997', 'ENSPANP00000022490', 'ENSP00000369497', 'ENSCCAP00000031676']

KatherineCaley · 2023-11-12T23:20:37Z

File formats issues (bad phylip) is ported to issue #37

KatherineCaley · 2023-11-12T23:31:10Z

data wrangling is in #28

KatherineCaley · 2023-11-12T23:31:26Z

collapsing this issue into #28

KatherineCaley · 2023-11-13T00:14:54Z

Closing as contented related to this issue will be described in #40

KatherineCaley self-assigned this Oct 17, 2023

GavinHuttley added the learning outcome label Oct 18, 2023

GavinHuttley assigned fredjaya and unassigned KatherineCaley Oct 19, 2023

fredjaya assigned GavinHuttley and unassigned fredjaya Oct 31, 2023

GavinHuttley changed the title ~~LO - Identifying and dealing with data issues~~ LO - 3 - Identifying and dealing with data issues Nov 12, 2023

KatherineCaley added the block 3 label Nov 12, 2023

KatherineCaley assigned KatherineCaley and unassigned GavinHuttley Nov 12, 2023

KatherineCaley mentioned this issue Nov 13, 2023

LO - 3 - From orthologs to alignment #40

Closed

KatherineCaley closed this as completed Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LO - 3 - Identifying and dealing with data issues #27

LO - 3 - Identifying and dealing with data issues #27

KatherineCaley commented Oct 17, 2023 •

edited by GavinHuttley

Loading

GavinHuttley commented Oct 19, 2023

GavinHuttley commented Oct 19, 2023

fredjaya commented Oct 19, 2023 •

edited

Loading

fredjaya commented Oct 31, 2023 •

edited

Loading

Data

Usage

Data

Usage

KatherineCaley commented Nov 12, 2023 •

edited

Loading

KatherineCaley commented Nov 12, 2023

KatherineCaley commented Nov 12, 2023

KatherineCaley commented Nov 13, 2023

LO - 3 - Identifying and dealing with data issues #27

LO - 3 - Identifying and dealing with data issues #27

Comments

KatherineCaley commented Oct 17, 2023 • edited by GavinHuttley Loading

GavinHuttley commented Oct 19, 2023

GavinHuttley commented Oct 19, 2023

fredjaya commented Oct 19, 2023 • edited Loading

fredjaya commented Oct 31, 2023 • edited Loading

Data

Usage

Data

Usage

KatherineCaley commented Nov 12, 2023 • edited Loading

KatherineCaley commented Nov 12, 2023

KatherineCaley commented Nov 12, 2023

KatherineCaley commented Nov 13, 2023

KatherineCaley commented Oct 17, 2023 •

edited by GavinHuttley

Loading

fredjaya commented Oct 19, 2023 •

edited

Loading

fredjaya commented Oct 31, 2023 •

edited

Loading

KatherineCaley commented Nov 12, 2023 •

edited

Loading