Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to batch gene alignments #306

Closed
revinici opened this issue Aug 8, 2024 · 4 comments
Closed

Add functionality to batch gene alignments #306

revinici opened this issue Aug 8, 2024 · 4 comments

Comments

@revinici
Copy link

revinici commented Aug 8, 2024

The gene alignment step can take a long time to complete. Is it possible to add functionality to output the unaligned gene sequences to later align batches of them across a computing cluster? The pirate pangenome tool, e.g., has a helper script that will align gene families listed in one of its output files, making it easy to subset the file and distribute the gene alignment step to thousands of computers.

@revinici
Copy link
Author

revinici commented Aug 14, 2024

I was able to hack a solution together by breaking apart the logical steps in the post_run_alignment_gen.py script into separate python scripts. The unaligned genes can be output to a folder and batched for alignment via a computing cluster. The gene alignments can then be aggregated into one folder to concatenate the core gene alignments.

For my purposes I broke apart the functionality for pan genome alignment linked below, ignoring the functionality for codon alignments. Thankfully, my solution does not require any edits to the panaroo code base!

#Write out core/pan-genome alignments
if args.aln == "pan":
if args.verbose: print("generating pan genome MSAs...")
generate_pan_genome_alignment(G, temp_dir, args.output_dir, args.n_cpu,
args.alr, args.codons, isolate_names)
core_nodes = get_core_gene_nodes(G, args.core, len(isolate_names))
core_names = [G.nodes[x]["name"] for x in core_nodes]
concatenate_core_genome_alignments(core_names, args.output_dir,
args.hc_threshold)

#Multithread writing gene sequences to disk (temp directory) so aligners can find them
unaligned_sequence_files = Parallel(n_jobs=threads)(
delayed(output_sequence)(G.nodes[x], isolates, temp_dir, output_dir)
for x in tqdm(G.nodes()))
#remove single sequence files
unaligned_sequence_files = filter(None, unaligned_sequence_files)
#Get Biopython command calls for each output gene sequences
commands = [
get_alignment_commands(fastafile, output_dir, aligner, threads)
for fastafile in unaligned_sequence_files
]
#Run these commands in a multi-threaded way
multi_align_sequences(commands, output_dir + "aligned_gene_sequences/",
threads, aligner)

Here is how I got access to the panaroo modules and methods for extracting the gene fastas to a folder. Note, you have to be careful about the assumptions the panaroo code makes like expecting '/' at the end of paths or deleting gene fasta files after alignment.

#!/usr/bin/env python
import networkx as nx
from joblib import Parallel, delayed
from tqdm import tqdm
import shutil

# get the path to the panaroo package files
panaroo_path = subprocess.run('pip show panaroo | grep Location | cut -f2 -d" "',
                              shell=True, check=True, capture_output=True, text=True).stdout
panaroo_path = os.path.join(panaroo_path, 'panaroo')

# add path to python path
sys.path.append(panaroo_path)

# import required panaroo modules
from panaroo.generate_alignments import output_sequence

# load the final pangenome graph
G = nx.read_gml("/path/to/final/graph_gml")

# Load isolate names
isolate_names = G.graph['isolateNames']

# make dummy dir
os.makedirs('panaroo_foo')

# create expected folder
os.makedirs(os.path.join('panaroo_foo', 'aligned_gene_sequences'), exist_ok=True)

# copy combined_DNA_CDS to dummy dir
subprocess.run(f'cp /path/to/combined_DNA_CDS.fasta panaroo_foo/', shell=True, check=True)

#Multithread writing gene sequences to disk (temp directory) so aligners can find them
#Note, if you have lots of genes to extract you can also batch the nodes in the graph for multiple computers 
os.makedirs('extracted_genes')
num_cpu = 24
unaligned_sequence_files = Parallel(n_jobs=num_cpu)(
    delayed(output_sequence)(G.nodes[x], isolate_names, 'extracted_genes/', 'panaroo_foo/')
    for x in tqdm(list(G.nodes()))
)

#remove single sequence files
unaligned_sequence_files = list(filter(None, unaligned_sequence_files))

# clean up
shutil.rmtree('panaroo_foo')

# create batches of extracted gene fastas for downstream alignment

@gtonkinhill
Copy link
Owner

Hi,

Thanks for posting this solution! I'll look at whether we can add this into Panaroo directly when I next get a chance.

gtonkinhill added a commit that referenced this issue Sep 24, 2024
@revinici
Copy link
Author

Okay, I got a much simpler solution. Creating the pangenome sequence alignments and sequence files can take a long time to create, but you can 'scale out' the task to multiple computers by batching the pangenome graph nodes and creating the pangenome alignments for each batch in parallel. My earlier solution just produced the unaligned gene sequences but the new simpler approach produces the all pangenome alignment output files in one go.

Assuming you have the pangenome graph nodes batched, then you can you can generate pangenome alignments for each batch with the code below:

#!/usr/bin/env python3.9
import os
from subprocess import run
import sys

import networkx as nx

# get the path to the panaroo package files
panaroo_path = run('pip show panaroo | grep Location | cut -f2 -d" "',
                              shell=True, check=True, capture_output=True, text=True).stdout
panaroo_path = os.path.join(panaroo_path, 'panaroo')

# add path to python path
sys.path.append(panaroo_path)

# import required panaroo modules
from panaroo.generate_output import generate_pan_genome_alignment

# set threads
threads = 10

# load the final pangenome graph
G = nx.read_gml("/path/to/final/graph_gml")

# load node batch
node_batch = "/path/to/node_batch_1.txt"
with open(node_batch, 'r') as IN:
    selected_nodes = IN.read().splitlines()

# generate a pangenome subgraph conditioned on the nodes in the batch
selected_nodes = [n for n in selected_nodes if n in G]
subgraph = G.subgraph(selected_nodes).copy()

# Load isolate names
isolate_names = G.graph['isolateNames']

temp_dir = 'temp_panalign'
os.makedirs(temp_dir, exist_ok=True)

pangenome_algn_output = f'pan_algn_{os.path.basename(node_batch)}'
os.makedirs(pangenome_algn_output, exist_ok=True)

# decompress the combined DNA and protein fastas
for archive, dest_file in [("combined_DNA_CDS.fasta.gz", "combined_DNA_CDS.fasta"),
                           ("combined_protein_CDS.fasta.gz", "combined_protein_CDS.fasta")]:

    run(f'pigz -d -c {archive} > {pangenome_algn_output}/{dest_file}', shell=True, check=True)


# generate gene and protein MSAs, and gene family multi-FASTAs for the gene nodes in the subgraph
do_codon_alignments = True
generate_pan_genome_alignment(G=subgraph,
                              temp_dir=os.path.join(temp_dir, ""),
                              output_dir=os.path.join(pangenome_algn_output, ""),
                              threads=threads,
                              aligner='mafft',
                              codons=do_codon_alignments,
                              isolates=isolate_names)

# clean up files
run(f'rm {pangenome_algn_output}/combined_protein_CDS.fasta', shell=True, check=True)
run(f'rm {pangenome_algn_output}/combined_DNA_CDS.fasta', shell=True, check=True)
run(f'rm -r {temp_dir}', shell=True, check=True)

# compress pangenome sequences
run(f'tar -cf - {pangenome_algn_output} | pigz -p {threads} > {pangenome_algn_output}.tar.gz', shell=True, check=True)
run(f'rm -r {pangenome_algn_output}', shell=True, check=True)

Then once all batches are done processing you can merge the results like this:

# uncompress the pangenome alignments
mkdir pangenome_alignments

for result_archive in pan_algn_*.tar.gz; do
    tar --use-compress-program 'pigz' -xf $result_archive --strip-components 1 -C pangenome_alignments
done

# make gene alignment archive
mv pangenome_alignments/aligned_gene_sequences gene_alignments
tar -cf - gene_alignments | pigz -p 10 > gene_alignments.tar.gz

# make protein alignment archive
mv pangenome_alignments/aligned_protein_sequences protein_alignments
tar -cf - protein_alignments | pigz -p 10 > protein_alignments.tar.gz

# make gene family sequences archive
mv pangenome_alignments/unaligned_dna_sequences gene_sequences
tar -cf - gene_sequences | pigz -p 10 > gene_sequences.tar.gz

# clean up
rm -r pangenome_alignments gene_alignments protein_alignments gene_sequences

@gtonkinhill
Copy link
Owner

Thanks very much for this. I will update the docs to point to this as a 'how to'.
In terms of the main Panaroo code, we've now added the option to specify 'none' as the aligner which will just generate the unaligned gene files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants