Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting information content (IC) using Pfam2GO #196

Closed
dvklopfenstein opened this issue Dec 28, 2020 · 2 comments
Closed

Getting information content (IC) using Pfam2GO #196

dvklopfenstein opened this issue Dec 28, 2020 · 2 comments

Comments

@dvklopfenstein
Copy link
Collaborator

How do I get information content from a non-standard annotation file (file is not GAF or GPAD), like pfam2go

@dvklopfenstein
Copy link
Collaborator Author

dvklopfenstein commented Dec 28, 2020

The Pfam2GO file looks like this:

Pfam:PF00004 AAA > GO:ATP binding ; GO:0005524
Pfam:PF00004 AAA > GO:ATPase activity ; GO:0016887
Pfam:PF00005 ABC_tran > GO:ATP binding ; GO:0005524
Pfam:PF00006 ATP-synt_ab > GO:ATP binding ; GO:0005524
Pfam:PF00009 GTP_EFTU > GO:GTPase activity ; GO:0003924
Pfam:PF00009 GTP_EFTU > GO:GTP binding ; GO:0005525

The simplest method is to reformat it into a id2go annotation file, whose format looks like this:

AAR1  GO:0005575;GO:0003674;GO:0006970;GO:0006970;GO:0040029
AAR2  GO:0005575;GO:0003674;GO:0040029;GO:0009845
ACD5  GO:0005575;GO:0003674;GO:0008219
ACL1  GO:0005575;GO:0003674;GO:0009965;GO:0010073
ACL2  GO:0005575;GO:0003674;GO:0009826
ACL3  GO:0005575;GO:0003674;GO:0009826;GO:0009965
ACL4  GO:0005575;GO:0003674;GO:0009826
ACS1  GO:0005575;GO:0003674;GO:0009615

Using the code below, you can then get the information content (IC) of various GO IDs using data from the pfam2go.txt file:

#!/usr/bin/env python3
"""Get information content for GO IDs based on information in the pfam2go.txt"""

from os.path import exists
from collections import defaultdict
from tests.utils import get_godag
from goatools.anno.idtogos_reader import IdToGosReader
from goatools.semantic import TermCounts
from goatools.semantic import get_info_content


def main():
    """Reformat pfam2go.txt to an ID2GOs annotation format"""
    fin = 'pfam2go.txt'
    file_anno = 'pfam2go.anno'

    # Convert pfam2go file to an annotation format
    if not exists(file_anno):
        pfam2gos = _get_pfam2gos(fin)
        _wr_anno(file_anno, pfam2gos)

    # Prepare to calculate the information content (IC)
    godag = get_godag('go-basic.obo')
    annoobj = IdToGosReader(file_anno)
    tcntobj = TermCounts(godag, annoobj.get_id2gos())

    # Calculate the information content (IC) for a GO ID
    go_id = 'GO:0004930'
    info_content = get_info_content(go_id, tcntobj)
    print('IC={IC:.05f} {GO} {NAME}'.format(
        GO=go_id,
        IC=info_content,
        NAME=godag[go_id].name))


def _wr_anno(fout, pfam2gos):
    """Write pfam2gos dict to an annotation file"""
    with open(fout, 'w') as prt:
        for pfam, gos in pfam2gos.items():
            prt.write('{NAME}\t{GOs}\n'.format(
                NAME=pfam,
                GOs=';'.join(gos)))
        print('  {N} pfams WROTE: {ANNO}'.format(
            N=len(pfam2gos),
            ANNO=fout))

def _get_pfam2gos(fin):
    """Read pfam2go.txt. Store in a dict"""
    pfam2gos = defaultdict(set)
    with open(fin) as ifstrm:
        for line in ifstrm:
            if line[:1] != '!':
                line = line.strip()
                go_id = line[line.find('; GO:')+2:]
                ids = line[:line.find(' >')].split()
                pfam2gos[ids[1]].add(go_id)
    return dict(pfam2gos)


if __name__ == '__main__':
    main()

@dvklopfenstein
Copy link
Collaborator Author

The researcher confirms this method works, so closing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant