Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNA graph construction and KNN representation #109

Closed
rg314 opened this issue Feb 19, 2022 · 4 comments
Closed

RNA graph construction and KNN representation #109

rg314 opened this issue Feb 19, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@rg314
Copy link
Contributor

rg314 commented Feb 19, 2022

I’ve just started looking at RNA graph construction. Ideally, I’d like to generate a KNN representation of the RNA. This function is currently implemented for proteins by using the graphein.protein.edges.distance.add_k_nn_edges function. In short, the edges for the KNN method are added by:

  1. compute distance matrix
    a. To compute the distance matrix we need to know the x,y,z position of each basepair (BP) of RNA
  2. Compute N nearest neighbours using (sklearn.neighbors.kneighbors_graph)
  3. Join interacting nodes calculated form 2.
  4. Return graph

At the moment the x,y,z cords for protein structures are obtained from a PDB file. This is currently not built for RNA structures. For an RNA sequence we must use the sequence and/or dot bracket notation to get the 3D structural information.

If the dot bracket notation is not provided and can be calculated using Nussinov Algorithm (DP approach, see https://github.com/cgoliver/Nussinov/blob/master/nussinov.py for python implementation). See implementation https://github.com/rg314/graphein/blob/rna-model/graphein/rna/nussinov.py

Note that nussinov algo does not guarantee that the dot-bracket notation is correct. There are several other ways of computing this.

The PDB database contains some RNA structures (~5233). PandasPdb can be used to directly read in the PDB file. I suggest that the current protein config is adapted for the RNA structure to read in the RNA structure from a PDB file. @a-r-j what do you think? I have started to implement this please see (https://github.com/rg314/graphein/blob/35bd2297d28bf09bcf0fb98c10c3866d4be6cb83/graphein/rna/graphs.py#L209 note reading in df is currently failing).

Then we can look at alternative sources for reading in the structure.

For example, it appears that the Xiao lab http://biophy.hust.edu.cn/new/ has a RESTful API to return RNA structure. However, I have not investigated this in detail and if it returns the correct 3D data. This could somewhat mimic the behaviour of graphein.protein.utils.download_alphafold_structure.

Does anyone have an idea of other databases that could be used?

I’m also open to creating a server that can be contacted with a RESTful API to predict RNA structure. However, we would need to figure out the best implementation for structure prediction (and make sure it doesn’t take too long 😉).

@rg314 rg314 changed the title RNA graph construction KNN representation RNA graph construction and KNN representation Feb 19, 2022
@a-r-j
Copy link
Owner

a-r-j commented Feb 19, 2022

Hey, thanks for this Ryan! Looks exciting!

So, I think we should keep RNA secondary structure & 3D structure separate for now. The secondary structure is functional as a standalone piece of functionality (though it would be really nice to hook it up to Nussinov or bpRNA - the largest database I know of).

With respect to 3D graphs - I had a quick look at this. I think it's actually quite straightforward as most of the components are implemented for protein structure graphs. Essentially, we can use the low-level API in graphein as building blocks and make a function more or less identical to the construct_graphs we use for proteins. The main things I saw so far that need changing:

We need some granularity options for RNA graphs

Then, we simply add a new function convert_structure_to_rna in this block eg.

RNA_ATOMS = [
    "C1'",
    "C2",
    "C2'",
    "C3'",
    "C4",
    "C4'",
    "C5",
    "C5'",
    "C6",
    "C8",
    "N1",
    "N2",
    "N3",
    "N4",
    "N6",
    "N7",
    "N9",
    "O2",
    "O2'",
    "O3'",
    "O4",
    "O4'",
    "O5'",
    "O6",
    "OP1",
    "OP2",
    "P",
]


def subset_structure_to_rna(
    df: pd.DataFrame,
) -> pd.DataFrame:
    """
    Return a subset of atomic dataframe that contains only certain atom names relevant for RNA structures.

    :param df: Protein Structure dataframe to subset
    :type df: pd.DataFrame
    :returns: Subsetted protein structure dataframe
    :rtype: pd.DataFrame
    """
    return filter_dataframe(
        df, by_column="atom_name", list_of_values=RNA_ATOMS, boolean=True
    )

but more flexible (not keeping the RNA_ATOMS fixed so users can subset as they wish)

The only other line that breaks is this one and we easily fix it by removing the three_to_1 call if we're constructing an RNA graph. Then we're good to go essentially. The graph has been populated with the nodes and we write whatever edge functions we like to go on top as per the protein API.

What I'm unfamiliar with is how we coarsen the RNA graphs. E.g. all atom is what I've described above. For proteins it's obviously very normal to consider the alpha carbon trace as representative of a residue-level graph. I'm not sure what the standard for RNA is. In any case, we can leave this open to users with the granularity param. What do you think?

@a-r-j a-r-j added the enhancement New feature or request label Feb 19, 2022
@a-r-j
Copy link
Owner

a-r-j commented Mar 22, 2022

Came across this today: https://www.biorxiv.org/content/10.1101/2022.03.14.484334v1

Might be of interest to you @rg314

@rg314
Copy link
Contributor Author

rg314 commented Apr 4, 2022

Just to follow up on this... we found that the nussinov.py algo isn't great at predicting the dot-bracket notation. I suggest that we create a container running https://github.com/rg314/centroid-rna-package and ping it to get the centroid secondary structure. What do you think @a-r-j ?

@a-r-j
Copy link
Owner

a-r-j commented Jul 12, 2022

Implemented in 1.5.0

@a-r-j a-r-j closed this as completed Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants