Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge kmcp profilings of multiple samples #45

Open
DrYoungOG opened this issue Jun 10, 2024 · 3 comments
Open

Merge kmcp profilings of multiple samples #45

DrYoungOG opened this issue Jun 10, 2024 · 3 comments

Comments

@DrYoungOG
Copy link

Thanks for the good software.

I have completed the searching and profiling commands and got profiling results of multiple samples. How to merge these profiling results into one feature table for downstream ayalysis?

Thank you!

@shenwei356
Copy link
Owner

You can merge them with existing tools, especially those for Metaphlan.

Profiling result formats

Taxonomic profiling output formats:

  • KMCP (-o/--out-file). Note that: abundances are only computed for target references rather than each taxon at all taxonomic ranks, so please output CAMI or MetaPhlAn format.
  • CAMI (-M/--metaphlan-report, --metaphlan-report-version, sample name: -s/--sample-id, taxonomy data: --taxonomy-id)
  • MetaPhlAn (-C/--cami-report, sample name: -s/--sample-id))

@DrYoungOG
Copy link
Author

I see. Thanks for your prompt reply!

@ruqse
Copy link

ruqse commented Feb 25, 2025

@DrYoungOG I'm guessing you've already figured out a solution, but for others encountering this issue: while the util script in Metaphlan performs the merge, it's designed for Metaphlan4, whereas kmcp metaphlan output formats are limited to Metaphlan2 and 3. Below is a variation of the script that accommodates outputs for both Metaphlan3 and 4.

#!/usr/bin/env python3
import argparse
import os
import sys
import pandas as pd
from itertools import takewhile
import re

def merge(aaastrIn, ostm, gtdb):
    """
    Outputs the table join of the given pre-split string collection.
    :param  aaastrIn:   One or more split lines from which data are read.
    :type   aaastrIn:   collection of collections of string collections
    :param  ostm:       Output stream to which matched rows are written.
    :type   ostm:       output stream
    """
    listmpaVersion = set()
    profiles_list = []
    merged_tables = None
    
    for f in aaastrIn:
        headers = [x.strip() for x in takewhile(lambda x: x.startswith('#'), open(f))]
        listmpaVersion.add(headers[0])
        
        # Check if this is a v3 or v4 format file
        if len(headers) > 1:
            # Extract sample name from header
            sample_name = os.path.splitext(os.path.basename(f))[0].replace('_profile', '')
            
            # Try to extract column names from the last header
            try:
                names = headers[-1].split('#')[1].strip().split('\t')
            except IndexError:
                # If splitting by # fails, try a different approach for v3
                names = [x.strip() for x in headers[-1][1:].split('\t')]
            
            # Determine which columns to use based on header format
            if len(names) >= 3 and 'relative_abundance' in names:
                # Version 4 format
                usecols = [0, 2] if not gtdb else range(2)
                abundance_col = 'relative_abundance'
            else:
                # Version 3 format - typically clade_name, NCBI_tax_id, relative_abundance
                usecols = [0, 2] if not gtdb else range(2)
                abundance_col = names[2] if len(names) > 2 else 'relative_abundance'
                
            # Read the data, skipping the header lines
            try:
                iIn = pd.read_csv(f, sep='\t', skiprows=len(headers), names=names, 
                                 usecols=usecols, index_col=0)
                
                # Create series with the taxonomy data and relative abundance
                if abundance_col in iIn.columns:
                    series = pd.Series(data=iIn[abundance_col], index=iIn.index, name=sample_name)
                else:
                    # If column name extraction failed, use position-based selection
                    df = pd.read_csv(f, sep='\t', skiprows=len(headers), header=None)
                    series = pd.Series(data=df.iloc[:, 2], index=df.iloc[:, 0], name=sample_name)
                
                profiles_list.append(series)
            except Exception as e:
                print(f"Error processing file {f}: {e}")
                continue
        else:
            print(f"Warning: File {f} has insufficient headers. Skipping.")
            continue
    
    if profiles_list:
        merged_tables = pd.concat(profiles_list, axis=1).fillna(0)
        ostm.write(list(listmpaVersion)[0]+'\n')
        merged_tables.to_csv(ostm, sep='\t')
    else:
        print("No valid data found to merge.")

argp = argparse.ArgumentParser(prog="merge_metaphlan_tables.py",
                               description="Performs a table join on one or more metaphlan output files.")
argp.add_argument("aistms", metavar="input.txt", nargs="*", help="One or more tab-delimited text tables to join")
argp.add_argument("-l", help="Name of file containing the paths to the files to combine")
argp.add_argument('-o', metavar="output.txt", help="Name of output file in which joined tables are saved")
argp.add_argument('--overwrite', default=False, action='store_true', help="Overwrite output file if exists")
argp.add_argument('--gtdb_profiles', action='store_true', default=False, help=("To specify when running the script with GTDB-based profiles"))
argp.usage = (argp.format_usage() + "\nPlease make sure to supply file paths to the files to combine.\n\n" +
              "If combining 3 files (Table1.txt, Table2.txt, and Table3.txt) the call should be:\n" +
              "   ./merge_metaphlan_tables.py Table1.txt Table2.txt Table3.txt > output.txt\n\n" +
              "A wildcard to indicate all .txt files that start with Table can be used as follows:\n" +
              "    ./merge_metaphlan_tables.py Table*.txt > output.txt")

def main():
    args = argp.parse_args()
    if args.l:
        args.aistms = [x.strip().split()[0] for x in open(args.l)]
    if not args.aistms:
        print('merge_metaphlan_tables: no inputs to merge!')
        return
    if args.o and os.path.exists(args.o) and not args.overwrite:
        print('merge_metaphlan_tables: output file "{}" exists, specify the --overwrite param to ovrewrite it!'.format(args.o))
        return
    merge(args.aistms, open(args.o, 'w') if args.o else sys.stdout, args.gtdb_profiles)

if __name__ == '__main__':
    main()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants