uBERTa

uORF BERT model to annotate putative TISs of the human genome.

Installation

(Optional) Create a new conda environment, e.g.
1. conda create -n uberta python=3.8 -c conda-forge -y
2. conda activate uberta
Clone this repo git clone https://github.com/skoblov-lab/uBERTa.git
cd uBERTa
Install dependencies
1. Install pytorch, e.g., conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
2. Install uBERTa dependencies pip install setup.py

To go through the notebooks, additional dependencies are required. We assume you are using conda.

Jupyter (e.g., jupyter lab): conda install jupyterlab -c conda-forge -y
conda install -c conda-forge -y scikit-learn scipy
pip install pyliftover

Usage

Use this link to download the trained model.

Currently, for proper functioning, uBERTa requires an experimental signal as a part of the input. Thus, its usage is limited to the regions of the human genomes well covered by experimental data. Keep this in mind while utilizing the model. The basic usage example is provided within notebooks/basic_usage.ipynb. For more advanced usage, consider exploring predict_5UTR.ipynb.

Note on XGBoost

XGBoost demonstrated better performance than the distilBERT model as explained in the paper. notebooks/xgb.ipynb contains all the code to train, validate, and predict 5'UTR TISs. The trained XGBoost model is available via this link.

Predictions

To download predictions for 5'UTRs, please use the following links:

The archives contain:

predictions_5UTR.tsv table with predictions and prediction probabilities for 5'UTRs. Genomic positions are given from the start-codon's first nucleotide, which is reversed for the "-" strand.
predictions_5UTR.bed file that can be loaded into genomic browser. Prediction probabilities are given in percents in the fifth column, while prediction types are denoted by colors and depend on the dataset. Namely, for the inference dataset encompassing 5'UTRs that did not undergo manual curation, green and blue colors denote positive and negative predictions. For 5'UTRs that did undergo manual curation, the colors are the following:
- green -- True Positive (TP)
- blue -- True Negative (TN)
- red -- False Negative (FN)
- black -- False Positive (FP)
prediction_scores.tsv table with prediction scores per dataset and start codon.

Additionally, we provide predictions for lncRNA:

uBERTa predictions
XGBoost predictions: TBD

Additional links

Check https://github.com/bioinf/uORF_annotator for uORF_annotator -- a tool to annotate functional impacts of the discovered uORFs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

uBERTa

Installation

Usage

Note on XGBoost

Predictions

Additional links

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
notebooks		notebooks
uBERTa		uBERTa
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.py		setup.py

License

minjaf/uBERTa

Folders and files

Latest commit

History

Repository files navigation

uBERTa

Installation

Usage

Note on XGBoost

Predictions

Additional links

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages