Skip to content

Latest commit

 

History

History
132 lines (93 loc) · 4.82 KB

README.md

File metadata and controls

132 lines (93 loc) · 4.82 KB

scripts

  • Most scripts require Python >= 3.9 and pandas

makes vOTU table from clustered vOTUs based on read mapping abundances. Can output read counts, counts per million, or reads per kilobase

usage: make_votu_table.py [-h] [-v {count,cpm,rpk}] [-s SUFFIX] mmseqstsv filelist

positional arguments:
  mmseqstsv             mmseqs flattened tsv file (output of flatten_mmseqs_tsv.py)
  filelist              file containing list of sample abundance/count files (one per line) to include
                        in vOTU table. count files are output of verse. sample names should be
                        prefixes

options:
  -h, --help            show this help message and exit
  -v {count,cpm,rpk}, --value {count,cpm,rpk}
                        what value from abundance file to fill in the table (default: count)
  -s SUFFIX, --suffix SUFFIX
                        suffix of abundance files to remove when making table (default:
                        _virus.count.CDS.cpm.txt)

makes table of taxonomic (and marker) gene annotations for representative sequences of vOTUs

usage: repseq_genomad_annotations.py [-h] votutable filelist

positional arguments:
  votutable   vOTU table
  filelist    file containing list of sample genomad virus_summary.tsv files (one per line) from which
              to get the repseq annotations. sample name should be prefix

options:
  -h, --help  show this help message and exit

Make gff-type table (gtf) from genomad summary output

usage: genomad_virus2gff.py [-h] [-m HEADERMAP] virussummary

positional arguments:
  virussummary          genomad *_virus_summary.tsv file

options:
  -h, --help            show this help message and exit
  -m HEADERMAP, --headermap HEADERMAP
                        genomad *_assembly_headermap.txt file. optional used in case original contig
                        names were changed.

Make gff-type table (gtf) from genomad *virus_genes.tsv and optionally, *plasmid_genes.tsv files.

  genesfile             genomad *_virus_genes.tsv file

options:
  -h, --help            show this help message and exit
  -m HEADERMAP, --headermap HEADERMAP
                        genomad *_assembly_headermap.txt file. optional used in case original contig
                        names were changed.
  -p PLASMIDGENESFILE, --plasmidgenesfile PLASMIDGENESFILE
                        genomad *_plasmid_genes.tsv file

Uses the plotnine python library v0.12.4 (newer versions will NOT work) to make heatmap from csv table.

usage: plotnine_heatmap.py [-h] [-a ABUND] [-t TITLE] [-d DROPCOLS] inputfile outputfile

This script uses the python library plotnine to make a heatmap from a csv table. The y-axis column should be the first one in the table after any columns are dropped.

positional arguments:
  inputfile             file with table from which to make heatmap
  outputfile            filename of output pdf

optional arguments:
  -h, --help            show this help message and exit
  -a ABUND, --abund ABUND
                        abundance value in table. used for labeling heatmap. the default is generic. (default: abund)
  -t TITLE, --title TITLE
                        Title of heatmap; double quoted
  -d DROPCOLS, --dropcols DROPCOLS
                        Columns to drop/remove before plotting - separated by commas; double quoted

This script takes in a gene abundance table and prints to STDOUT a subset of the table with top genes by mean abundance or prevalence and mean abundance across samples.

usage: top_genes.py [-h] [-n NTOP] [-m {mean,prev,median}] [-c IDCOL] [-i IGNORE] [-d]
                    genetab


positional arguments:
  genetab               input gene abundance table

optional arguments:
  -h, --help            show this help message and exit
  -n NTOP, --ntop NTOP  number of top genes to keep in output table. (default: 10)
  -m {mean,prev,median}, --metric {mean,prev,median}
                        how to choose top genes. by mean abundance across samples, by
                        prevalence first and then mean, or by median. (default: mean)
  -c IDCOL, --idcol IDCOL
                        column name with gene id. default will use the first column
  -i IGNORE, --ignore IGNORE
                        other columns besides idcol to ignore when choosing top genes.
                        they will still be included in the output. column names should be
                        separated by commas; double quoted
  -d, --debug           print debug messages