Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have conversion from stratified biom to melted pandas table / tensor #125

Open
mortonjt opened this issue Jun 23, 2021 · 4 comments
Open

Comments

@mortonjt
Copy link
Contributor

mortonjt commented Jun 23, 2021

Right now, the biom table OGU ids consist of both taxa and KEGG ids. It would be nice if there were convenience functions to allow for conversions to gene tables or tensors -- it is nontrivial to implement this from scratch.

I'm pretty close to a working solution, will post on this thread shortly

@mortonjt
Copy link
Contributor Author

mortonjt commented Jun 23, 2021

Alright as promised, here is the solution for turning these function tables into microbe x gene counts

from scipy.sparse import coo_matrix
import pandas as pd
import numpy as np

def get_microbe_gene_table(func_table):                                                                                           
    """ Obtain a genes per microbe table.                                                                                 
                                                                                                                         
   Parameters                                                                                                            
    ----------                                                                                                            
   func_table : path                                                                                                       
        Stratified biom table output from woltka                                                                  
                                                                                                                          
    Returns                                                                                                               
    -------                                                                                                               
    Table of microbes by gene counts                                                                                      
    """                                                                                                                   
    func_ids = func_table.ids(axis='observation')                                                                         
    func_df = pd.DataFrame(list(map(lambda x: x.split('|'), func_ids)))                                                   
                                                                                                                          
    # convert to sparse matrix for convenience                                                                            
    func_df.columns = ['OGU', 'KEGG']                                                                                     
    func_df['count'] = 1                                                                                                  
                                                                                                                          
    ogus = list(set(func_df['OGU']))                                                                                      
    ogu_lookup = pd.Series(np.arange(0, len(ogus)), ogus)                                                                 
    keggs = list(set(func_df['KEGG']))                                                                                    
    kegg_lookup = pd.Series(np.arange(0, len(keggs)), keggs)                                                              
    func_df['OGU_id'] = func_df['OGU'].apply(lambda x: ogu_lookup.loc[x]).astype(np.int64)                                
    func_df['KEGG_id'] = func_df['KEGG'].apply(lambda x: kegg_lookup.loc[x]).astype(np.int64)                             
    c, i, j = func_df['count'].values, func_df['OGU_id'].values, func_df['KEGG_id'].values                                
    data = coo_matrix((c, (i, j)))                                          
    # pandas conversion optional. Can convert to biom if needed                                              
    ko_ogu = pd.DataFrame(data.todense(), index=ogus, columns=keggs)                                                      
    return ko_ogu    

@qiyunzhu
Copy link
Owner

@mortonjt Thank you for sharing thoughts and code! Will be great if you can clarify what is a microbe x gene counts table? By reading your code I have the following impression. Is my understanding correct?

Before:

FeatureID S01 S02 S03
Ecoli|K0123 2 0 5
Ecoli|K0456 13 7 4
Strep|K0123 0 3 8

After:

OGU KEGG S01 S02 S03
Ecoli K0123 2 0 5
Ecoli K0456 13 7 4
Strep K0123 0 3 8

Also pinging @droush because this question may be relevant.

@mortonjt
Copy link
Contributor Author

mortonjt commented Jun 25, 2021

Not quite, the code that I provided loses the sample information - it also keeps track of gene copy number per microbe
So it'll look like something like

OGU K0123 K0456 K0123
Ecoli 1 0 0
Strep 0 1 0
Cdiff 0 0 1

Another useful format would be a sparse COO format, where the output would be 4 columns like

OGU KEGG Sample Counts
Ecoli K0123 S01 2
Ecoli K0123 S03 5
Ecoli K0456 S01 13
Ecoli K0456 S02 7
Ecoli K0456 S03 4
Strep K0123 S02 4
Ecoli K0123 S03 8

@qiyunzhu
Copy link
Owner

@mortonjt Very good idea! I edited your response a bit to make the tables rendering correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants