Add functionality to batch gene alignments #306

revinici · 2024-08-08T22:41:38Z

The gene alignment step can take a long time to complete. Is it possible to add functionality to output the unaligned gene sequences to later align batches of them across a computing cluster? The pirate pangenome tool, e.g., has a helper script that will align gene families listed in one of its output files, making it easy to subset the file and distribute the gene alignment step to thousands of computers.

revinici · 2024-08-14T15:59:27Z

I was able to hack a solution together by breaking apart the logical steps in the post_run_alignment_gen.py script into separate python scripts. The unaligned genes can be output to a folder and batched for alignment via a computing cluster. The gene alignments can then be aggregated into one folder to concatenate the core gene alignments.

For my purposes I broke apart the functionality for pan genome alignment linked below, ignoring the functionality for codon alignments. Thankfully, my solution does not require any edits to the panaroo code base!

panaroo/panaroo/post_run_alignment_gen.py

Lines 107 to 116 in e928a7a

    
           #Write out core/pan-genome alignments 
        
           if args.aln == "pan": 
        
               if args.verbose: print("generating pan genome MSAs...") 
        
               generate_pan_genome_alignment(G, temp_dir, args.output_dir, args.n_cpu, 
        
                                             args.alr, args.codons, isolate_names) 
        
               core_nodes = get_core_gene_nodes(G, args.core, len(isolate_names)) 
        
               core_names = [G.nodes[x]["name"] for x in core_nodes] 
        
               concatenate_core_genome_alignments(core_names, args.output_dir,  
        
                                                  args.hc_threshold)

panaroo/panaroo/generate_output.py

Lines 318 to 333 in e928a7a

    
           #Multithread writing gene sequences to disk (temp directory) so aligners can find them 
        
           unaligned_sequence_files = Parallel(n_jobs=threads)( 
        
               delayed(output_sequence)(G.nodes[x], isolates, temp_dir, output_dir) 
        
               for x in tqdm(G.nodes())) 
        
           #remove single sequence files 
        
           unaligned_sequence_files = filter(None, unaligned_sequence_files) 
        
           #Get Biopython command calls for each output gene sequences 
        
           commands = [ 
        
               get_alignment_commands(fastafile, output_dir, aligner, threads) 
        
               for fastafile in unaligned_sequence_files 
        
           ] 
        
           #Run these commands in a multi-threaded way 
        
           multi_align_sequences(commands, output_dir + "aligned_gene_sequences/", 
        
                                 threads, aligner)

Here is how I got access to the panaroo modules and methods for extracting the gene fastas to a folder. Note, you have to be careful about the assumptions the panaroo code makes like expecting '/' at the end of paths or deleting gene fasta files after alignment.

#!/usr/bin/env python
import networkx as nx
from joblib import Parallel, delayed
from tqdm import tqdm
import shutil

# get the path to the panaroo package files
panaroo_path = subprocess.run('pip show panaroo | grep Location | cut -f2 -d" "',
                              shell=True, check=True, capture_output=True, text=True).stdout
panaroo_path = os.path.join(panaroo_path, 'panaroo')

# add path to python path
sys.path.append(panaroo_path)

# import required panaroo modules
from panaroo.generate_alignments import output_sequence

# load the final pangenome graph
G = nx.read_gml("/path/to/final/graph_gml")

# Load isolate names
isolate_names = G.graph['isolateNames']

# make dummy dir
os.makedirs('panaroo_foo')

# create expected folder
os.makedirs(os.path.join('panaroo_foo', 'aligned_gene_sequences'), exist_ok=True)

# copy combined_DNA_CDS to dummy dir
subprocess.run(f'cp /path/to/combined_DNA_CDS.fasta panaroo_foo/', shell=True, check=True)

#Multithread writing gene sequences to disk (temp directory) so aligners can find them
#Note, if you have lots of genes to extract you can also batch the nodes in the graph for multiple computers 
os.makedirs('extracted_genes')
num_cpu = 24
unaligned_sequence_files = Parallel(n_jobs=num_cpu)(
    delayed(output_sequence)(G.nodes[x], isolate_names, 'extracted_genes/', 'panaroo_foo/')
    for x in tqdm(list(G.nodes()))
)

#remove single sequence files
unaligned_sequence_files = list(filter(None, unaligned_sequence_files))

# clean up
shutil.rmtree('panaroo_foo')

# create batches of extracted gene fastas for downstream alignment

gtonkinhill · 2024-08-16T05:45:14Z

Hi,

Thanks for posting this solution! I'll look at whether we can add this into Panaroo directly when I next get a chance.

… to be run separately as requested in #306

revinici · 2024-09-25T19:12:37Z

Okay, I got a much simpler solution. Creating the pangenome sequence alignments and sequence files can take a long time to create, but you can 'scale out' the task to multiple computers by batching the pangenome graph nodes and creating the pangenome alignments for each batch in parallel. My earlier solution just produced the unaligned gene sequences but the new simpler approach produces the all pangenome alignment output files in one go.

Assuming you have the pangenome graph nodes batched, then you can you can generate pangenome alignments for each batch with the code below:

#!/usr/bin/env python3.9
import os
from subprocess import run
import sys

import networkx as nx

# get the path to the panaroo package files
panaroo_path = run('pip show panaroo | grep Location | cut -f2 -d" "',
                              shell=True, check=True, capture_output=True, text=True).stdout
panaroo_path = os.path.join(panaroo_path, 'panaroo')

# add path to python path
sys.path.append(panaroo_path)

# import required panaroo modules
from panaroo.generate_output import generate_pan_genome_alignment

# set threads
threads = 10

# load the final pangenome graph
G = nx.read_gml("/path/to/final/graph_gml")

# load node batch
node_batch = "/path/to/node_batch_1.txt"
with open(node_batch, 'r') as IN:
    selected_nodes = IN.read().splitlines()

# generate a pangenome subgraph conditioned on the nodes in the batch
selected_nodes = [n for n in selected_nodes if n in G]
subgraph = G.subgraph(selected_nodes).copy()

# Load isolate names
isolate_names = G.graph['isolateNames']

temp_dir = 'temp_panalign'
os.makedirs(temp_dir, exist_ok=True)

pangenome_algn_output = f'pan_algn_{os.path.basename(node_batch)}'
os.makedirs(pangenome_algn_output, exist_ok=True)

# decompress the combined DNA and protein fastas
for archive, dest_file in [("combined_DNA_CDS.fasta.gz", "combined_DNA_CDS.fasta"),
                           ("combined_protein_CDS.fasta.gz", "combined_protein_CDS.fasta")]:

    run(f'pigz -d -c {archive} > {pangenome_algn_output}/{dest_file}', shell=True, check=True)


# generate gene and protein MSAs, and gene family multi-FASTAs for the gene nodes in the subgraph
do_codon_alignments = True
generate_pan_genome_alignment(G=subgraph,
                              temp_dir=os.path.join(temp_dir, ""),
                              output_dir=os.path.join(pangenome_algn_output, ""),
                              threads=threads,
                              aligner='mafft',
                              codons=do_codon_alignments,
                              isolates=isolate_names)

# clean up files
run(f'rm {pangenome_algn_output}/combined_protein_CDS.fasta', shell=True, check=True)
run(f'rm {pangenome_algn_output}/combined_DNA_CDS.fasta', shell=True, check=True)
run(f'rm -r {temp_dir}', shell=True, check=True)

# compress pangenome sequences
run(f'tar -cf - {pangenome_algn_output} | pigz -p {threads} > {pangenome_algn_output}.tar.gz', shell=True, check=True)
run(f'rm -r {pangenome_algn_output}', shell=True, check=True)

Then once all batches are done processing you can merge the results like this:

# uncompress the pangenome alignments
mkdir pangenome_alignments

for result_archive in pan_algn_*.tar.gz; do
    tar --use-compress-program 'pigz' -xf $result_archive --strip-components 1 -C pangenome_alignments
done

# make gene alignment archive
mv pangenome_alignments/aligned_gene_sequences gene_alignments
tar -cf - gene_alignments | pigz -p 10 > gene_alignments.tar.gz

# make protein alignment archive
mv pangenome_alignments/aligned_protein_sequences protein_alignments
tar -cf - protein_alignments | pigz -p 10 > protein_alignments.tar.gz

# make gene family sequences archive
mv pangenome_alignments/unaligned_dna_sequences gene_sequences
tar -cf - gene_sequences | pigz -p 10 > gene_sequences.tar.gz

# clean up
rm -r pangenome_alignments gene_alignments protein_alignments gene_sequences

gtonkinhill · 2024-10-07T08:02:19Z

Thanks very much for this. I will update the docs to point to this as a 'how to'.
In terms of the main Panaroo code, we've now added the option to specify 'none' as the aligner which will just generate the unaligned gene files.

gtonkinhill added a commit that referenced this issue Sep 24, 2024

added option specify 'none' as the aligner. This allows for alignment…

4dc499e

… to be run separately as requested in #306

gtonkinhill closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to batch gene alignments #306

Add functionality to batch gene alignments #306

revinici commented Aug 8, 2024

revinici commented Aug 14, 2024 •

edited

Loading

gtonkinhill commented Aug 16, 2024

revinici commented Sep 25, 2024

gtonkinhill commented Oct 7, 2024

Add functionality to batch gene alignments #306

Add functionality to batch gene alignments #306

Comments

revinici commented Aug 8, 2024

revinici commented Aug 14, 2024 • edited Loading

gtonkinhill commented Aug 16, 2024

revinici commented Sep 25, 2024

gtonkinhill commented Oct 7, 2024

revinici commented Aug 14, 2024 •

edited

Loading