Have conversion from stratified biom to melted pandas table / tensor #125

mortonjt · 2021-06-23T18:25:57Z

Right now, the biom table OGU ids consist of both taxa and KEGG ids. It would be nice if there were convenience functions to allow for conversions to gene tables or tensors -- it is nontrivial to implement this from scratch.

I'm pretty close to a working solution, will post on this thread shortly

mortonjt · 2021-06-23T18:44:54Z

Alright as promised, here is the solution for turning these function tables into microbe x gene counts

from scipy.sparse import coo_matrix
import pandas as pd
import numpy as np

def get_microbe_gene_table(func_table):                                                                                           
    """ Obtain a genes per microbe table.                                                                                 
                                                                                                                         
   Parameters                                                                                                            
    ----------                                                                                                            
   func_table : path                                                                                                       
        Stratified biom table output from woltka                                                                  
                                                                                                                          
    Returns                                                                                                               
    -------                                                                                                               
    Table of microbes by gene counts                                                                                      
    """                                                                                                                   
    func_ids = func_table.ids(axis='observation')                                                                         
    func_df = pd.DataFrame(list(map(lambda x: x.split('|'), func_ids)))                                                   
                                                                                                                          
    # convert to sparse matrix for convenience                                                                            
    func_df.columns = ['OGU', 'KEGG']                                                                                     
    func_df['count'] = 1                                                                                                  
                                                                                                                          
    ogus = list(set(func_df['OGU']))                                                                                      
    ogu_lookup = pd.Series(np.arange(0, len(ogus)), ogus)                                                                 
    keggs = list(set(func_df['KEGG']))                                                                                    
    kegg_lookup = pd.Series(np.arange(0, len(keggs)), keggs)                                                              
    func_df['OGU_id'] = func_df['OGU'].apply(lambda x: ogu_lookup.loc[x]).astype(np.int64)                                
    func_df['KEGG_id'] = func_df['KEGG'].apply(lambda x: kegg_lookup.loc[x]).astype(np.int64)                             
    c, i, j = func_df['count'].values, func_df['OGU_id'].values, func_df['KEGG_id'].values                                
    data = coo_matrix((c, (i, j)))                                          
    # pandas conversion optional. Can convert to biom if needed                                              
    ko_ogu = pd.DataFrame(data.todense(), index=ogus, columns=keggs)                                                      
    return ko_ogu

qiyunzhu · 2021-06-24T22:17:56Z

@mortonjt Thank you for sharing thoughts and code! Will be great if you can clarify what is a microbe x gene counts table? By reading your code I have the following impression. Is my understanding correct?

Before:

FeatureID	S01	S02	S03
Ecoli\|K0123	2	0	5
Ecoli\|K0456	13	7	4
Strep\|K0123	0	3	8

After:

OGU	KEGG	S01	S02	S03
Ecoli	K0123	2	0	5
Ecoli	K0456	13	7	4
Strep	K0123	0	3	8

Also pinging @droush because this question may be relevant.

mortonjt · 2021-06-25T03:35:13Z

Not quite, the code that I provided loses the sample information - it also keeps track of gene copy number per microbe
So it'll look like something like

OGU	K0123	K0456	K0123
Ecoli	1	0	0
Strep	0	1	0
Cdiff	0	0	1

Another useful format would be a sparse COO format, where the output would be 4 columns like

OGU	KEGG	Sample	Counts
Ecoli	K0123	S01	2
Ecoli	K0123	S03	5
Ecoli	K0456	S01	13
Ecoli	K0456	S02	7
Ecoli	K0456	S03	4
Strep	K0123	S02	4
Ecoli	K0123	S03	8

qiyunzhu · 2021-06-25T15:04:33Z

@mortonjt Very good idea! I edited your response a bit to make the tables rendering correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have conversion from stratified biom to melted pandas table / tensor #125

Have conversion from stratified biom to melted pandas table / tensor #125

mortonjt commented Jun 23, 2021 •

edited

Loading

mortonjt commented Jun 23, 2021 •

edited

Loading

qiyunzhu commented Jun 24, 2021

mortonjt commented Jun 25, 2021 •

edited by qiyunzhu

Loading

qiyunzhu commented Jun 25, 2021

Have conversion from stratified biom to melted pandas table / tensor #125

Have conversion from stratified biom to melted pandas table / tensor #125

Comments

mortonjt commented Jun 23, 2021 • edited Loading

mortonjt commented Jun 23, 2021 • edited Loading

qiyunzhu commented Jun 24, 2021

mortonjt commented Jun 25, 2021 • edited by qiyunzhu Loading

qiyunzhu commented Jun 25, 2021

mortonjt commented Jun 23, 2021 •

edited

Loading

mortonjt commented Jun 23, 2021 •

edited

Loading

mortonjt commented Jun 25, 2021 •

edited by qiyunzhu

Loading