Skip to content

A python feature extraction and selection tool for DNA, RNA, Protein sequencing data

Notifications You must be signed in to change notification settings

ashinandjay/FeatureSelection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

SeqFea-Learn - An intergrated python package for DNA, RNA and Protein sequecing data analysis

Include Feature extraction, Feature selection, Dimensionality Reduction, Models Construction for sequencing data.

Table of Contents

Installation

The package is developed using Python 3(Python Version 3.0 or above) and it can be run on Linux operatinig system. We strongly recommand user to install Ancaonda Python 3.7 or above version to avoid installing other packages.

After installaing Anaconda, the following packages need to be installed:

  1. xgboost
  2. skrebate
  3. lightgbm

The source code is freely available at: https://github.com/ashinandjay/FeatureSelection

To install our tool, first download the zip file manually from github, or use the code below in Unix:

cd your_folder_path
wget https://github.com/ashinandjay/FeatureSelection/archive/master.zip

Unzip the file:

unzip master.zip

Data Preparation

The DNA, RNA or protein sequence data (FASTA format) and their labels (txt format) are required for using our feature selection tool.

DNA Feature Extraction

The tool includes 16 feature extraction methods for DNA sequencing data.

DNA Extraction Method DNA Extraction Number
Kmer 1
Reverse Compliment Kmer 2
Pseudo dinucleotide composition 3
Pseudo k-tuple nucleotide composition 4
Dinucleotide-based auto covariance 5
Dinucleotide-based cross covariance 6
Dinucleotide-based auto-cross covariance 7
Trinucleotide-based auto covariance 8
Trinucleotide-based cross covariance 9
Trinucleotide-based auto-cross covariance 10
Nucleic acid composition 11
Di-nucleotide composition 12
Tri-nucleotide composition 13
zcurve 14
monoMonoKGap 15
monoDiKGap 16

DNA_Feature_Extraction require two inputs: DNA Extraction number and DNA sequencing data.

Run DNA_Feature_Extraction.py:

DNA_Feature_Extraction.py [DNA Extraction number] [DNA sequencing data]

Example: Use kmer method to extract features from DNA sequencing data

DNA_Feature_Extraction.py 1 DNA_sequencing.txt

RNA Feature Extraction

The tool includes 12 feature extraction methods for RNA sequencing data.

RNA Extraction Method RNA Extraction Number
Kmer 1
Reverse Compliment Kmer 2
Pseudo dinucleotide composition 3
Dinucleotide-based auto covariance 4
Dinucleotide-based cross covariance 5
Dinucleotide-based auto-cross covariance 6
Nucleic acid composition 7
Di-nucleotide composition 8
Tri-nucleotide composition 9
zcurve 10
monoMonoKGap 11
monoDiKGap 12

RNA_Feature_Extraction require two inputs: RNA Extraction number and RNA sequencing data.

Run RNA_Feature_Extraction.py:

RNA_Feature_Extraction.py [RNA Extraction number] [RNA sequencing data]

Example: Use kmer method to extract features from RNA sequencing data

RNA_Feature_Extraction.py 1 RNA_sequencing.txt

Protein Feature Extraction

The tool includes 32 feature extraction methods for Protein sequencing data.

Protein Extraction Method Protein Extraction Number
Amino acid composition 1
Composition of k-spaced amino acid pairs 2
Dipeptide composition 3
Grouped dipeptide composition 4
Grouped tripeptide composition 5
Cojoint triad 6
k-spaced cojoint triad 7
Composition 8
Transition 9
Distribution 10
Encoding based on grouped weight 11
Auto covariance 12
Moran autocorrelation 13
Geary autocorrelation 14
Quasi-sequence-order 15
Pseudo-amino acid composition 16
Amphiphilic pseudo-amino acid composition 17
Amino Acid Composition PSSM 18
Dipeptide composition PSSM 19
Pseudo PSSM 20
Auto covariance PSSM 21
Cross covariance PSSM 22
Auto Cross covariance PSSM 23
Bigram-PSSM 24
AB-PSSM 25
Secondary structure composition 26
Accessible surface area composition 27
Torsional angles composition 28
Torsional angles bigram 29
Structural probabilities Bigram 30
Torsional angles auto-covariance 31
Structural probabilities auto-covariance 32

Protein_Feature_Extraction require two inputs: Protein Extraction number and Protein sequencing data.

Run Protein_Feature_Extraction.py:

Protein_Feature_Extraction.py [Protein Extraction number] [Protein sequencing data]

Example: Use Amino acid composition method to extract features from Protein sequencing data

Protein_Feature_Extraction.py 1 Protein_sequencing.txt

Feature Selection

Our Feature Selection tool contains 20 supervised selection methods.

Feature Selection Method Feature Selection Number
Lasso 1
Elastic Net 2
L1-SVM 3
CHI2 4
Pearson Correlation 5
ExtraTree 6
XGBoost 7
SVM-RFE 8
LOG-RFE 9
Mutual Information 10
Minimum Redundancy Maximum Relevance 11
Joint Mutual Information 12
Maximum-Relevance-Maximum-Distance 13
ReliefF 14
Trace Ratio 15
Gini index 16
SPEC 17
Fisher Score 18
T Score 19
Information Gain 20

For using our Feacture Selection Tool, Four inputs are required:

  1. Feauture selection number (See the table above)
  2. Number of feature to select (how many number of feature you want)
  3. Feature Vectors (Feature extraction output file)
  4. Label Vectors (labels for sequencing)

Run Feature_Selection.py:

Feature_Selection.py [Feauture selection number] [Number of feature to select] [Feature Vectors] [Label Vectors]

Example: Using Lasso method to select 3 features

Feature_Selection.py 1 3 Feaute_Vectors.csv label.txt

Dimensionality Reduction

Our Feature Reduction tool contains 16 unsupervised dimensionality reduction methods.

Dimensionality Reduction Method Feature Reduction Number
K-means 1
T-SNE 2
Principal Component Analysis 3
Kernel PCA 4
Locally-linear embedding 5
Singular Value Decomposition 6
Non-negative matrix factorization 7
Multi-dimensional Scaling 8
Independent Component Analysis 9
Factor Analysis 10
Agglomerate Feature 11
Gaussian random projection 12
Sparse random projection 13
Autoencoder 14
Gaussian Noise Autoencoder 15
Variational Autoencoder 16

For using our Feacture Reduction Tool, Three inputs are required:

  1. Feauture Reduction number (See the table above)
  2. Number of Clusters (how many number of Clusters you want)
  3. Feature Vectors (Feature extraction output file)

Run Feature_Reduction.py:

Feature_Reduction.py [Feauture Reduction number] [Number of Clusters to select] [Feature Vectors]

Example: Using PCA method to select 3 clusters

Feature_Reduction.py 3 3 Feaute_Vectors.csv

Feature Evaluation

Feature selection method can be evaluated using 10 classifiers and 3 deep learning methods include SVM, KNN, RandomForest, LightGBM, XGBoost, Adaboost, Bagging, ExtraTree, gaussian Naïve Bayes, gradient boosting, DNN, CNN and RNN predictors. The classification accurcay comparison files (plot and table) will be generated in same folder of code.

Run Feature_Evaluation.py:

Feature_Evaluation.py [Feauture selection output] [Label Vectors]

Example: evaluating Lasso selection method

Feature_Evaluation.py Lasso.csv label.txt

About

A python feature extraction and selection tool for DNA, RNA, Protein sequencing data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages