This project is not completed.
Anu, a machine learning (ML) model to predict protein-protien interaction. Anu is a framework to test and benchmark ML models for prediction protein-protein interactions. It automates data retrieval, feature engineering and model evaluation.
- git
- python 3.7 or above
- python virtual environment
git clone https://github.com/ankitskvmdam/anu.git
python -m venv venv # create python environment
. ./venv/bin/activate # activate python enviroment
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
or
pipx install poetry
For more information about poetry visit poetry docs
pip install nox
Run tests, lint check, type check, doc tests, coverage
nox
For more information visit nox tutorial
In order to use this tool. First few steps are similar to developing step.
git clone https://github.com/ankitskvmdam/anu.git
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
or
pipx install poetry
Now you have to run the following command
# First move to the directory
cd anu
# Installing anu
poetry install
- Pickle - Interacting protein database
- Negatome - Non-interacting protein database
Currently there is no way to specify anu to download only one databases. This feature will be implemented in future release.
# Download both databases
anu data fetch databases
# For help/more information
anu data fetch databases --help
- Pickle dataset dataframe (vaex dataframes)
- Negatome dataset dataframe (vaex dataframes)
Currently there is no way to specify anu to make individual dataframes. This feature will be implemented in future release.
# Prepare pickle and negatome dataframe
anu data prepare dataframes
# For help/more information
anu data prepare dataframes -- help
Now we have to fetch the PDB file.
Since there are almost 30,000 proteins in pickle database and around 10,000 in negative database. It is hard to fetch them all at once. The fetching process is resumable. And for testing only 300 to 400 files for each dataset is enough. So once you have downloaded enough file you can press ctrl+c to exit.
# For help/more information
anu data fetch pdb --help
# Fetch pdb files for protein present in pickle dataset
anu data fetch pdb -p
# or
anu data fetch pdb --pickle
# Fetch pdb file for protein present in negatome dataset
anu data fetch pdb -n
# or
anu data fetch pdb --negatome
# Fetch pdb file from both data set
anu data fetch pdb
If the pdb file is already downloaded it will not be downloaded again. Downloading of pdb files is sync between both datasets.
This is also a time taking process.
# For help/more information
anu data prepare inputs --help
# Prepare interacting protein dataframe
anu data prepare inputs -i
# or
anu data prepare inputs --interacting
# Prepare non interacting protein dataframe
anu data prepare inputs -n
# or
anu data prepare inputs --non-interacting
# Prepare both input dataframes
anu data prepare inputs
Currently cnn model is only available.
anu train cnn
Before prediction you have to train the model.
# For help/more information
anu predict protein --help
# given pdb id as input
anu predict protein -p "1gzx" "4hh3"
# give uniprot id as input
anu predict protein -u "F4JRB0" "Q8RX29"
# give path as input
anu predict protein "path/to/protein/a.pdb" "path/to/protein/b.pdb"