-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to batch gene alignments #306
Comments
I was able to hack a solution together by breaking apart the logical steps in the For my purposes I broke apart the functionality for pan genome alignment linked below, ignoring the functionality for codon alignments. Thankfully, my solution does not require any edits to the panaroo code base! panaroo/panaroo/post_run_alignment_gen.py Lines 107 to 116 in e928a7a
panaroo/panaroo/generate_output.py Lines 318 to 333 in e928a7a
Here is how I got access to the panaroo modules and methods for extracting the gene fastas to a folder. Note, you have to be careful about the assumptions the panaroo code makes like expecting '/' at the end of paths or deleting gene fasta files after alignment. #!/usr/bin/env python
import networkx as nx
from joblib import Parallel, delayed
from tqdm import tqdm
import shutil
# get the path to the panaroo package files
panaroo_path = subprocess.run('pip show panaroo | grep Location | cut -f2 -d" "',
shell=True, check=True, capture_output=True, text=True).stdout
panaroo_path = os.path.join(panaroo_path, 'panaroo')
# add path to python path
sys.path.append(panaroo_path)
# import required panaroo modules
from panaroo.generate_alignments import output_sequence
# load the final pangenome graph
G = nx.read_gml("/path/to/final/graph_gml")
# Load isolate names
isolate_names = G.graph['isolateNames']
# make dummy dir
os.makedirs('panaroo_foo')
# create expected folder
os.makedirs(os.path.join('panaroo_foo', 'aligned_gene_sequences'), exist_ok=True)
# copy combined_DNA_CDS to dummy dir
subprocess.run(f'cp /path/to/combined_DNA_CDS.fasta panaroo_foo/', shell=True, check=True)
#Multithread writing gene sequences to disk (temp directory) so aligners can find them
#Note, if you have lots of genes to extract you can also batch the nodes in the graph for multiple computers
os.makedirs('extracted_genes')
num_cpu = 24
unaligned_sequence_files = Parallel(n_jobs=num_cpu)(
delayed(output_sequence)(G.nodes[x], isolate_names, 'extracted_genes/', 'panaroo_foo/')
for x in tqdm(list(G.nodes()))
)
#remove single sequence files
unaligned_sequence_files = list(filter(None, unaligned_sequence_files))
# clean up
shutil.rmtree('panaroo_foo')
# create batches of extracted gene fastas for downstream alignment |
Hi, Thanks for posting this solution! I'll look at whether we can add this into Panaroo directly when I next get a chance. |
… to be run separately as requested in #306
Okay, I got a much simpler solution. Creating the pangenome sequence alignments and sequence files can take a long time to create, but you can 'scale out' the task to multiple computers by batching the pangenome graph nodes and creating the pangenome alignments for each batch in parallel. My earlier solution just produced the unaligned gene sequences but the new simpler approach produces the all pangenome alignment output files in one go. Assuming you have the pangenome graph nodes batched, then you can you can generate pangenome alignments for each batch with the code below: #!/usr/bin/env python3.9
import os
from subprocess import run
import sys
import networkx as nx
# get the path to the panaroo package files
panaroo_path = run('pip show panaroo | grep Location | cut -f2 -d" "',
shell=True, check=True, capture_output=True, text=True).stdout
panaroo_path = os.path.join(panaroo_path, 'panaroo')
# add path to python path
sys.path.append(panaroo_path)
# import required panaroo modules
from panaroo.generate_output import generate_pan_genome_alignment
# set threads
threads = 10
# load the final pangenome graph
G = nx.read_gml("/path/to/final/graph_gml")
# load node batch
node_batch = "/path/to/node_batch_1.txt"
with open(node_batch, 'r') as IN:
selected_nodes = IN.read().splitlines()
# generate a pangenome subgraph conditioned on the nodes in the batch
selected_nodes = [n for n in selected_nodes if n in G]
subgraph = G.subgraph(selected_nodes).copy()
# Load isolate names
isolate_names = G.graph['isolateNames']
temp_dir = 'temp_panalign'
os.makedirs(temp_dir, exist_ok=True)
pangenome_algn_output = f'pan_algn_{os.path.basename(node_batch)}'
os.makedirs(pangenome_algn_output, exist_ok=True)
# decompress the combined DNA and protein fastas
for archive, dest_file in [("combined_DNA_CDS.fasta.gz", "combined_DNA_CDS.fasta"),
("combined_protein_CDS.fasta.gz", "combined_protein_CDS.fasta")]:
run(f'pigz -d -c {archive} > {pangenome_algn_output}/{dest_file}', shell=True, check=True)
# generate gene and protein MSAs, and gene family multi-FASTAs for the gene nodes in the subgraph
do_codon_alignments = True
generate_pan_genome_alignment(G=subgraph,
temp_dir=os.path.join(temp_dir, ""),
output_dir=os.path.join(pangenome_algn_output, ""),
threads=threads,
aligner='mafft',
codons=do_codon_alignments,
isolates=isolate_names)
# clean up files
run(f'rm {pangenome_algn_output}/combined_protein_CDS.fasta', shell=True, check=True)
run(f'rm {pangenome_algn_output}/combined_DNA_CDS.fasta', shell=True, check=True)
run(f'rm -r {temp_dir}', shell=True, check=True)
# compress pangenome sequences
run(f'tar -cf - {pangenome_algn_output} | pigz -p {threads} > {pangenome_algn_output}.tar.gz', shell=True, check=True)
run(f'rm -r {pangenome_algn_output}', shell=True, check=True) Then once all batches are done processing you can merge the results like this: # uncompress the pangenome alignments
mkdir pangenome_alignments
for result_archive in pan_algn_*.tar.gz; do
tar --use-compress-program 'pigz' -xf $result_archive --strip-components 1 -C pangenome_alignments
done
# make gene alignment archive
mv pangenome_alignments/aligned_gene_sequences gene_alignments
tar -cf - gene_alignments | pigz -p 10 > gene_alignments.tar.gz
# make protein alignment archive
mv pangenome_alignments/aligned_protein_sequences protein_alignments
tar -cf - protein_alignments | pigz -p 10 > protein_alignments.tar.gz
# make gene family sequences archive
mv pangenome_alignments/unaligned_dna_sequences gene_sequences
tar -cf - gene_sequences | pigz -p 10 > gene_sequences.tar.gz
# clean up
rm -r pangenome_alignments gene_alignments protein_alignments gene_sequences |
Thanks very much for this. I will update the docs to point to this as a 'how to'. |
The gene alignment step can take a long time to complete. Is it possible to add functionality to output the unaligned gene sequences to later align batches of them across a computing cluster? The pirate pangenome tool, e.g., has a helper script that will align gene families listed in one of its output files, making it easy to subset the file and distribute the gene alignment step to thousands of computers.
The text was updated successfully, but these errors were encountered: