Skip to content

This script runs MASH given a set of reads against specified references.

License

Notifications You must be signed in to change notification settings

tiagofilipe12/mash_wrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mash_wrapper.py

About

mash_wrapper.py is a script that is intended to allow users to use mash. It provides options to run mash dist as well as mash screen with different purposes. While mash dist is used to calculate the similarities of any sequence with all the sequences in the (msh) database, mash screen is used to check if any of those sequences in database (in the msh file) might be present in the provided reads. So, basically, this script features two different ways to check the presence of query sequences both in sequences (Fasta) and in reads (FastQ).

Installation

Using pipy

  • pip3 install mash-wrapper

Using git release

Note: Please check this Troubleshooting, since there seems to exist an issue with mash screen and msh files generated in other machines.

How to run

Assembly or fasta comparison

mash_wrapper.py -rs <yourdatabase.msh> -a <your_fasta> -t <number_of_threads> -j -o <your_output_folder>

Mash screen for read samples

mash_wrapper.py -rs <yourdatabase.msh> -r <read(s)> -o <output_name> -t <number_of_threads> -j -ms -w

Troubleshooting

If for some reason the above command for Mash screen mode doesn't work.

Download reference fasta file.

And run the following command:

mash_wrapper.py -rf reference.fasta -r <read(s)> -o <output_name> -t <number_of_threads> -j -ms

Options:

optional arguments:
  -h, --help            show this help message and exit
  -i INPUTFILE [INPUTFILE ...], --input_references INPUTFILE [INPUTFILE ...]
                        Provide the input fasta files to parse. Not required
                        for multiple comparisons between assemblies.
  -rs REF_SKETCH, --reference_sketch REF_SKETCH
                        If you have a reference sketch for references provide
                        it with this option.
  -rf REF_FASTA, --reference_fasta REF_FASTA
                        If you have a reference fasta provide it with this
                        option.
  -r READS [READS ...], --reads READS [READS ...]
                        Provide the input read files to parse. Usually fastq 1
                        or 2 files. This option is mutually with "-f".
  -f SEQUENCES [SEQUENCES ...], --sequences SEQUENCES [SEQUENCES ...]
                        Provide the input sequence files to parse. Usually
                        fasta files. This option is mutually exclusive with
                        "-r".
  -o OUTPUT_TAG, --output OUTPUT_TAG
                        Provide an output tag
  -t THREADS, --threads THREADS
                        Provide the number of threads to be used. Default: 1
  -no_rm, --no-remove   Specify if you do not want to remove the output
                        concatenated fasta.
  -j, --json            If you desire to export a json file with all
                        significant entries use this options.
  -m, --mashix          Perform a matrix of all mash distance, taking all
                        files.

MASH related options:
  -k KMER_SIZE, --kmers KMER_SIZE
                        Provide the number of k-mers to be provided to mash
                        sketch. Default: 21
  -p PVALUE, --pvalue PVALUE
                        Provide the p-value to consider a distance
                        significant. Default: 0.05.
  -md MASHDISTANCE, --mashdist MASHDISTANCE
                        Provide the maximum mash distance to be parsed to the
                        matrix. Default: 0.1.
  -ms, --mashscreen     Runs mash screen. This will prevent mash dist to run.
  -w, --winner_takes_it_all
                        Uses the winner takes it all function to remove
                        redundancy from mash screen results. NOTE: DO NOT USE
                        FOR PLASMID ID!
  -id MIN_IDENTITY, --min_identity MIN_IDENTITY
                        Provide the minimum identity for mash screen run.
                        Default is 0, reporting all values

Outputs:

  • A text file with MASH distance outputs as well as .msh files for each input read file.

  • A html file with a bar plot graphical visualization, with each input read file as a series of data.

  • A json file with keys as accession numbers and values as the results from mash.

About

This script runs MASH given a set of reads against specified references.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages