Merge kmcp profilings of multiple samples #45

DrYoungOG · 2024-06-10T02:35:18Z

Thanks for the good software.

I have completed the searching and profiling commands and got profiling results of multiple samples. How to merge these profiling results into one feature table for downstream ayalysis?

Thank you!

shenwei356 · 2024-06-10T07:12:11Z

You can merge them with existing tools, especially those for Metaphlan.

Profiling result formats

Taxonomic profiling output formats:

KMCP (-o/--out-file). Note that: abundances are only computed for target references rather than each taxon at all taxonomic ranks, so please output CAMI or MetaPhlAn format.
CAMI (-M/--metaphlan-report, --metaphlan-report-version, sample name: -s/--sample-id, taxonomy data: --taxonomy-id)
MetaPhlAn (-C/--cami-report, sample name: -s/--sample-id))

DrYoungOG · 2024-06-13T12:42:45Z

I see. Thanks for your prompt reply!

ruqse · 2025-02-25T15:17:37Z

@DrYoungOG I'm guessing you've already figured out a solution, but for others encountering this issue: while the util script in Metaphlan performs the merge, it's designed for Metaphlan4, whereas kmcp metaphlan output formats are limited to Metaphlan2 and 3. Below is a variation of the script that accommodates outputs for both Metaphlan3 and 4.

#!/usr/bin/env python3
import argparse
import os
import sys
import pandas as pd
from itertools import takewhile
import re

def merge(aaastrIn, ostm, gtdb):
    """
    Outputs the table join of the given pre-split string collection.
    :param  aaastrIn:   One or more split lines from which data are read.
    :type   aaastrIn:   collection of collections of string collections
    :param  ostm:       Output stream to which matched rows are written.
    :type   ostm:       output stream
    """
    listmpaVersion = set()
    profiles_list = []
    merged_tables = None
    
    for f in aaastrIn:
        headers = [x.strip() for x in takewhile(lambda x: x.startswith('#'), open(f))]
        listmpaVersion.add(headers[0])
        
        # Check if this is a v3 or v4 format file
        if len(headers) > 1:
            # Extract sample name from header
            sample_name = os.path.splitext(os.path.basename(f))[0].replace('_profile', '')
            
            # Try to extract column names from the last header
            try:
                names = headers[-1].split('#')[1].strip().split('\t')
            except IndexError:
                # If splitting by # fails, try a different approach for v3
                names = [x.strip() for x in headers[-1][1:].split('\t')]
            
            # Determine which columns to use based on header format
            if len(names) >= 3 and 'relative_abundance' in names:
                # Version 4 format
                usecols = [0, 2] if not gtdb else range(2)
                abundance_col = 'relative_abundance'
            else:
                # Version 3 format - typically clade_name, NCBI_tax_id, relative_abundance
                usecols = [0, 2] if not gtdb else range(2)
                abundance_col = names[2] if len(names) > 2 else 'relative_abundance'
                
            # Read the data, skipping the header lines
            try:
                iIn = pd.read_csv(f, sep='\t', skiprows=len(headers), names=names, 
                                 usecols=usecols, index_col=0)
                
                # Create series with the taxonomy data and relative abundance
                if abundance_col in iIn.columns:
                    series = pd.Series(data=iIn[abundance_col], index=iIn.index, name=sample_name)
                else:
                    # If column name extraction failed, use position-based selection
                    df = pd.read_csv(f, sep='\t', skiprows=len(headers), header=None)
                    series = pd.Series(data=df.iloc[:, 2], index=df.iloc[:, 0], name=sample_name)
                
                profiles_list.append(series)
            except Exception as e:
                print(f"Error processing file {f}: {e}")
                continue
        else:
            print(f"Warning: File {f} has insufficient headers. Skipping.")
            continue
    
    if profiles_list:
        merged_tables = pd.concat(profiles_list, axis=1).fillna(0)
        ostm.write(list(listmpaVersion)[0]+'\n')
        merged_tables.to_csv(ostm, sep='\t')
    else:
        print("No valid data found to merge.")

argp = argparse.ArgumentParser(prog="merge_metaphlan_tables.py",
                               description="Performs a table join on one or more metaphlan output files.")
argp.add_argument("aistms", metavar="input.txt", nargs="*", help="One or more tab-delimited text tables to join")
argp.add_argument("-l", help="Name of file containing the paths to the files to combine")
argp.add_argument('-o', metavar="output.txt", help="Name of output file in which joined tables are saved")
argp.add_argument('--overwrite', default=False, action='store_true', help="Overwrite output file if exists")
argp.add_argument('--gtdb_profiles', action='store_true', default=False, help=("To specify when running the script with GTDB-based profiles"))
argp.usage = (argp.format_usage() + "\nPlease make sure to supply file paths to the files to combine.\n\n" +
              "If combining 3 files (Table1.txt, Table2.txt, and Table3.txt) the call should be:\n" +
              "   ./merge_metaphlan_tables.py Table1.txt Table2.txt Table3.txt > output.txt\n\n" +
              "A wildcard to indicate all .txt files that start with Table can be used as follows:\n" +
              "    ./merge_metaphlan_tables.py Table*.txt > output.txt")

def main():
    args = argp.parse_args()
    if args.l:
        args.aistms = [x.strip().split()[0] for x in open(args.l)]
    if not args.aistms:
        print('merge_metaphlan_tables: no inputs to merge!')
        return
    if args.o and os.path.exists(args.o) and not args.overwrite:
        print('merge_metaphlan_tables: output file "{}" exists, specify the --overwrite param to ovrewrite it!'.format(args.o))
        return
    merge(args.aistms, open(args.o, 'w') if args.o else sys.stdout, args.gtdb_profiles)

if __name__ == '__main__':
    main()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge kmcp profilings of multiple samples #45

Merge kmcp profilings of multiple samples #45

DrYoungOG commented Jun 10, 2024

shenwei356 commented Jun 10, 2024

DrYoungOG commented Jun 13, 2024

ruqse commented Feb 25, 2025

Merge kmcp profilings of multiple samples #45

Merge kmcp profilings of multiple samples #45

Comments

DrYoungOG commented Jun 10, 2024

shenwei356 commented Jun 10, 2024

Profiling result formats

DrYoungOG commented Jun 13, 2024

ruqse commented Feb 25, 2025