A Python package for RNA Motif Library creation.
preprint: (insert paper link here)
A default CSV (nrlist_3.262_3.5A.csv
) is in the directory data/csvs
.
Make sure to download the most recent data (3.5 Å resolution):
http://rna.bgsu.edu/rna3dhub/nrlist
In the directory data/csvs
, delete the default CSV file and replace with your download.
# Clone the repository and navigate to project directory
git clone https://github.com/YesselmanLab/rna_motif_library.git
cd rna_motif_library
# Create and activate new conda environment
conda create --name rna_motif_env python=3.8
conda activate rna_motif_env
# Install the package
pip install .
# Make sure you put the downloaded CSV in the right place or you will get errors
# Put the CSV where the default CSV is and delete the default
# Note: despite the use of "PDB" in language, all files are actually ".cif", not ".pdb"
# ALWAYS CLEAR THE data/out_csvs DIRECTORY BEFORE RUNNING THE SCRIPT (if it exists)! Move the data somewhere else if you want to keep it.
# If this folder is not empty (its nonexistence is OK), there will be problems!
# To create the library first we need to download the PDBs specified in the CSV
python rna_motif_library/cli.py download-cifs --threads 8
# Replace "8" with the number of CPU cores you want to use
# Estimated time: 15 minutes for around 2000 .cifs
# Expect a progress bar when it's working
# After downloading we need to process with DSSR
python rna_motif_library/cli.py process-dssr --threads 8
# Replace "8" with the number of CPU cores you want to use
# Estimated time: 90 minutes for around 2000 .cifs
# There will be visual feedback in the terminal window if it's working properly
# Feedback will consist of the path to the PDB/CIF files
# After processing with DSSR we need to process with SNAP
python rna_motif_library/cli.py process-snap --threads 8
# Replace "8" with the number of CPU cores you want to use
# Estimated time: 9 hours for around 2000 .cifs
# There will be visual feedback in the terminal window if it's working properly
# Feedback will consist of the path + other information on nucleotides/etc
# After processing with SNAP we need to generate motif files
python rna_motif_library/cli.py generate-motifs
# Estimated time: 5 days for around 2000 .cifs
# There will be visual feedback in the terminal window if it's working properly
# Feedback will display the names of the motifs being processed
# After generating motifs we find tertiary contacts
python rna_motif_library/cli.py load-tertiary-contacts
# No threading for this one
# Estimated time: 36 hours for around 2000 .cifs
# There will be visual feedback in the terminal window if it's working properly
# Feedback will display which motifs' hydrogen bonding it's looking at
When finished, you will see several new directories.
data/motifs
- motifs found in the non-redundant set go here, categorized by type, size, and sequence
data/interactions
- individual residues which hydrogen-bond with each other go here, classified by which residues are
interacting
data/tertiary_contacts
- tertiary contacts found go here, classified by what two types of motifs are in the contact
data/out_csvs
- CSVs with further data go here
data/out_json
- motif data for each PDB is saved in JSON files for further analysis
Note: folders in data/motifs
named nways
refer to n-way junctions (2ways, 3ways, etc)
The figures used were generated whilst running update_library.py
using the default CSV inside the
directory data/csvs
.
For further details, check out figure_plotting.py
.
Figures 2 and 3 are PNGs; they are in the project directory.
Figure 4 also consists of PNGs, however, every interaction/atom combination gets its own figure.
These figures can be found in the directory heatmaps
.
Data for each respective figure is broken down in CSV files, which are in heatmap_data
.
If you are interested in only a certain number of PDBs, you can run the following:
# Make sure to first delete the directories "data/motifs", "data/interactions", "data/tertiary_contacts", and "data/out_csvs" first so data doesn't overlap
python cli.py generate-motifs --limit 8
# Replace "8" with your desired number
# This will always run certain files first; the order is not random, but fixed every time
If you are interested in a specific PDB, you can run the following command in the same directory:
# Make sure to delete the directories "motifs", "interactions", "tertiary_contacts", "heatmaps", and "heatmap_data" if you've run the full code already
# Make sure your file is within the nonredundant set
# Look for "PDB_name.json" and "PDB_name.out" in /dssr_output and /snap_output
python cli.py generate-motifs --pdb 3R9X
# Replace "3R9X" with your desired PDB
Once generate-motifs
is run and completed, it will create a number of JSON files in data/out_json
, each containing motif data.
These files will contain properties of the motif as well as the CIF data and coordinates of the atoms.
This data can be used for deeper analyses without having to re-run the script again.
To regenerate motif .cif
files from JSON data, run the following command:
python rna_motif_library/cli.py reload-from-json
This command will take all the JSON files in data/out_json
and regenerate the motifs accodingly.
You may get a very unusual error involving DSSR (or other aspects of the program) that I have yet to discover.
In that case, remove the offending .cif
, .json
, and .out
from data/pdbs
, data/dssr_output
, and data/snap_output
and data/snap_output
, before running the generate_motifs
command again.
This will remove the offending PDB from the end data set.
I have added automated error handling for this, but if it doesn't work (ends up in an infinite loop), or other errors come up, contact us at (email), and send the traceback in a .txt file, along with the files you removed.
When running is finished you may see a number of new CSVs with data inside the package directory.
Here we will describe the most important CSVs.
TODO update this section post-refactor