Skip to content

5. Notes on Directories and Workflow

Jeff Bowman edited this page Apr 6, 2024 · 3 revisions

The paprica-build.sh script and associated Python scripts should be left in the paprica directory. If you want to build new a new, custom database without overwriting ref_genome_database it is best to just give the new database a unique name and build it within the paprica directory.

When using paprica-run.sh for analysis it is assumed that you’re migrating paprica-run.sh to your working directory but leaving the Python scripts in their original location. This allows you to customize the flags for the Python scripts for each new analysis in a reproducible manner. Of course using paprica-run.sh isn’t necessary at all; you can execute the Python scripts however you like (e.g. in ipython notebook or directly from the command line). Refer to the Wiki and to paprica-run.sh to see what flags you should include as you structure your commands. In all likelihood you will be running paprica-run.sh on multiple samples, possibly many samples. The bottleneck of the paprica-run.sh script is phylogenetic placement with pplacer during paprica_place_it.py. paprica now includes an option for running pplacer in parallel, but watch your memory useage! Because the whole pipeline is easily parallelized in this way it is most efficient to run multiple samples with a simple while loop. For example the following loop would execute paprica-run.sh using the bacteria database on all the files in the working directory with a .fasta extension:

for f in *.fasta;do NAME=$(basename $f .fasta); ./paprica-run.sh $NAME bacteria; done

Then you aggregate the results with:

paprica-combine_results.py -domain bacteria -o [name for your analysis]

Note that if you previously ran paprica-combine_results.py you may need to delete those files first.