PyFuncover : Full proteome search for a specific function using BLAST and PFAM.
Python Function uncover ( PyFuncover ) is a new bioinformatic tool able to search for protein with a specific function in a full proteome. The pipeline coded in python uses BLAST alignment and the sequences from a PFAM family as search seed. The methodology is based on Ada-BLAST which is no longer available. We tested PyFuncover using the FABP family Lipocalin_7 from PFAM (version 32, 2019) against the Homo sapiens NCBI proteome. After applying the scoring function in all the BLAST results, the data was classified and submitted to a GO-TERM analysis using bioDBnet. Analysis showed that all family of FABPs were ranked in the 900 and plus protein. Above this threshold were found families able to bind to hydrophobic molecules similar to fatty acid such as the retinol acid transporter and the cellular retinoic acid-binding protein.
- Windows
- Unix
- Python 2.7+
- Numpy
- Pandas
- BioPython
- Matplotlib
- NCBI-BLAST+
python PyFuncover.py --update
-pfam : List of PFAM familly ID : PF#### each separated by a blank space
PyFuncover.py -pfam PF14651 PF#### ...
-taxid: The list of TaxID for each organism you want to download a proteome Each separated by a space (for example Human and Yeast taxid)
PyFuncover.py -taxid 9606 559492
Can be a Taxid that represent a node in the phylogenetic tree (Eukaryotes : 2759 ; Insecta : 50557, ...) He will retrieve all availlable assembly for them
--update : Download the last release of the NCBI Taxonomic Database Download the last RefSeq, Prokaryote and Eukaryote Genome Assembly List
PyFuncover.py --update
--out : Filename output Format are in CSV format (pandas.to_csv output) Default : result.csv
--nb-blast : The number of parrallelized BLAST process (default : 10) Be carefull, high number will use lot of memory and create a stck overflow !
--db : The list of choosen cross-ref number to retrieve data from bioDBnet database : default : 137 45 46 47 (UNIPROT ID, GO-TERMs Databases)
WARNING ! : Too high number of requested cross-refs will occur a slow-mode request 1 by 1.
If it get an error for 1 request with 1 protein in the mode 1 by 1,
the program will ABORT with a too high number of choosen cross-ref exception !
--nb-prot : The number of protein per request to the bioDBnet Database:
WARNING ! : Too high number of requested cross-refs will occur a slow-mode request 1 by 1.
If it throw an error for 1 request in this mode, the program will ABORT with a
too high number of cross ref choosen exception !
- : Affy ID
- : Agilent ID
- : Allergome Code
- : ApiDB_CryptoDB ID
- : Biocarta Pathway Name
- : BioCyc ID
- : CCDS ID
- : Chromosomal Location
- : CleanEx ID
- : CodeLink ID
- : COSMIC ID
- : CPDB Protein Interactor
- : CTD Disease Info
- : CTD Disease Name
- : CYGD ID
- : dbSNP ID
- : dictyBase ID
- : DIP ID
- : DisProt ID
- : DrugBank Drug ID
- : DrugBank Drug Info
- : DrugBank Drug Name
- : EC Number
- : EchoBASE ID
- : EcoGene ID
- : Ensembl Biotype
- : Ensembl Gene ID
- : Ensembl Gene Info
- : Ensembl Protein ID
- : Ensembl Transcript ID
- : FlyBase Gene ID
- : FlyBase Protein ID
- : FlyBase Transcript ID
- : GAD Disease Info
- : GAD Disease Name
- : GenBank Nucleotide Accession
- : GenBank Nucleotide GI
- : GenBank Protein Accession
- : GenBank Protein GI
- : Gene ID
- : Gene Info
- : Gene Symbol
- : Gene Symbol and Synonyms
- : Gene Symbol ORF
- : Gene Synonyms
- : GeneFarm ID
- : GO - Biological Process
- : GO - Cellular Component
- : GO - Molecular Function
- : GO ID
- : GSEA Standard Name
- : H-Inv Locus ID
- : HAMAP ID
- : HGNC ID
- : HMDB Metabolite
- : Homolog - All Ens Gene ID
- : Homolog - All Ens Protein ID
- : Homolog - All Gene ID
- : Homolog - Human Ens Gene ID
- : Homolog - Human Ens Protein ID
- : Homolog - Human Gene ID
- : Homolog - Mouse Ens Gene ID
- : Homolog - Mouse Ens Protein ID
- : Homolog - Mouse Gene ID
- : Homolog - Rat Ens Gene ID
- : Homolog - Rat Ens Protein ID
- : Homolog - Rat Gene ID
- : HomoloGene ID
- : HPA ID
- : HPRD ID
- : HPRD Protein Complex
- : HPRD Protein Interactor
- : Illumina ID
- : IMGT/GENE-DB ID
- : InterPro ID
- : IPI ID
- : KEGG Disease ID
- : KEGG Gene ID
- : KEGG Orthology ID
- : KEGG Pathway ID
- : KEGG Pathway Info
- : KEGG Pathway Title
- : LegioList ID
- : Leproma ID
- : Locus Tag
- : MaizeGDB ID
- : MEROPS ID
- : MGC(ZGC/XGC) ID
- : MGC(ZGC/XGC) Image ID
- : MGC(ZGC/XGC) Info
- : MGI ID
- : MIM ID
- : MIM Info
- : miRBase ID
- : NCIPID Pathway Name
- : NCIPID Protein Complex
- : NCIPID Protein Interactor
- : NCIPID PTM
- : Orphanet ID
- : PANTHER ID
- : Paralog - Ens Gene ID
- : PBR ID
- : PDB ID
- : PeroxiBase ID
- : Pfam ID
- : PharmGKB Drug Info
- : PharmGKB Gene ID
- : PIR ID
- : PIRSF ID
- : PptaseDB ID
- : PRINTS ID
- : ProDom ID
- : PROSITE ID
- : PseudoCAP ID
- : PubMed ID
- : Reactome ID
- : Reactome Pathway Name
- : REBASE ID
- : RefSeq Genomic Accession
- : RefSeq Genomic GI
- : RefSeq mRNA Accession
- : RefSeq ncRNA Accession
- : RefSeq Nucleotide GI
- : RefSeq Protein Accession
- : RefSeq Protein GI
- : Rfam ID
- : RGD ID
- : SGD ID
- : SMART ID
- : STRING Protein Interactor
- : TAIR ID
- : Taxon ID
- : TCDB ID
- : TIGRFAMs ID
- : TubercuList ID
- : UCSC ID
- : UniGene ID
- : UniProt Accession
- : UniProt Entry Name
- : UniProt Info
- : UniProt Protein Name
- : UniSTS ID
- : VectorBase Gene ID
- : VEGA Gene ID
- : VEGA Protein ID
- : VEGA Transcript ID
- : WormBase Gene ID
- : WormPep Protein ID
- : XenBase Gene ID
- : ZFIN ID