-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge kmcp profilings of multiple samples #45
Comments
You can merge them with existing tools, especially those for Metaphlan. Profiling result formatsTaxonomic profiling output formats:
|
I see. Thanks for your prompt reply! |
@DrYoungOG I'm guessing you've already figured out a solution, but for others encountering this issue: while the util script in Metaphlan performs the merge, it's designed for Metaphlan4, whereas kmcp metaphlan output formats are limited to Metaphlan2 and 3. Below is a variation of the script that accommodates outputs for both Metaphlan3 and 4. #!/usr/bin/env python3
import argparse
import os
import sys
import pandas as pd
from itertools import takewhile
import re
def merge(aaastrIn, ostm, gtdb):
"""
Outputs the table join of the given pre-split string collection.
:param aaastrIn: One or more split lines from which data are read.
:type aaastrIn: collection of collections of string collections
:param ostm: Output stream to which matched rows are written.
:type ostm: output stream
"""
listmpaVersion = set()
profiles_list = []
merged_tables = None
for f in aaastrIn:
headers = [x.strip() for x in takewhile(lambda x: x.startswith('#'), open(f))]
listmpaVersion.add(headers[0])
# Check if this is a v3 or v4 format file
if len(headers) > 1:
# Extract sample name from header
sample_name = os.path.splitext(os.path.basename(f))[0].replace('_profile', '')
# Try to extract column names from the last header
try:
names = headers[-1].split('#')[1].strip().split('\t')
except IndexError:
# If splitting by # fails, try a different approach for v3
names = [x.strip() for x in headers[-1][1:].split('\t')]
# Determine which columns to use based on header format
if len(names) >= 3 and 'relative_abundance' in names:
# Version 4 format
usecols = [0, 2] if not gtdb else range(2)
abundance_col = 'relative_abundance'
else:
# Version 3 format - typically clade_name, NCBI_tax_id, relative_abundance
usecols = [0, 2] if not gtdb else range(2)
abundance_col = names[2] if len(names) > 2 else 'relative_abundance'
# Read the data, skipping the header lines
try:
iIn = pd.read_csv(f, sep='\t', skiprows=len(headers), names=names,
usecols=usecols, index_col=0)
# Create series with the taxonomy data and relative abundance
if abundance_col in iIn.columns:
series = pd.Series(data=iIn[abundance_col], index=iIn.index, name=sample_name)
else:
# If column name extraction failed, use position-based selection
df = pd.read_csv(f, sep='\t', skiprows=len(headers), header=None)
series = pd.Series(data=df.iloc[:, 2], index=df.iloc[:, 0], name=sample_name)
profiles_list.append(series)
except Exception as e:
print(f"Error processing file {f}: {e}")
continue
else:
print(f"Warning: File {f} has insufficient headers. Skipping.")
continue
if profiles_list:
merged_tables = pd.concat(profiles_list, axis=1).fillna(0)
ostm.write(list(listmpaVersion)[0]+'\n')
merged_tables.to_csv(ostm, sep='\t')
else:
print("No valid data found to merge.")
argp = argparse.ArgumentParser(prog="merge_metaphlan_tables.py",
description="Performs a table join on one or more metaphlan output files.")
argp.add_argument("aistms", metavar="input.txt", nargs="*", help="One or more tab-delimited text tables to join")
argp.add_argument("-l", help="Name of file containing the paths to the files to combine")
argp.add_argument('-o', metavar="output.txt", help="Name of output file in which joined tables are saved")
argp.add_argument('--overwrite', default=False, action='store_true', help="Overwrite output file if exists")
argp.add_argument('--gtdb_profiles', action='store_true', default=False, help=("To specify when running the script with GTDB-based profiles"))
argp.usage = (argp.format_usage() + "\nPlease make sure to supply file paths to the files to combine.\n\n" +
"If combining 3 files (Table1.txt, Table2.txt, and Table3.txt) the call should be:\n" +
" ./merge_metaphlan_tables.py Table1.txt Table2.txt Table3.txt > output.txt\n\n" +
"A wildcard to indicate all .txt files that start with Table can be used as follows:\n" +
" ./merge_metaphlan_tables.py Table*.txt > output.txt")
def main():
args = argp.parse_args()
if args.l:
args.aistms = [x.strip().split()[0] for x in open(args.l)]
if not args.aistms:
print('merge_metaphlan_tables: no inputs to merge!')
return
if args.o and os.path.exists(args.o) and not args.overwrite:
print('merge_metaphlan_tables: output file "{}" exists, specify the --overwrite param to ovrewrite it!'.format(args.o))
return
merge(args.aistms, open(args.o, 'w') if args.o else sys.stdout, args.gtdb_profiles)
if __name__ == '__main__':
main() |
Thanks for the good software.
I have completed the searching and profiling commands and got profiling results of multiple samples. How to merge these profiling results into one feature table for downstream ayalysis?
Thank you!
The text was updated successfully, but these errors were encountered: