diff --git a/README.md b/README.md index 4a6a898..ca938ef 100644 --- a/README.md +++ b/README.md @@ -23,6 +23,7 @@ This package is trying to be as user friendly as possible. One of the hopes is t * Patrick Roelli ([@Hoohm)](https://github.com/Hoohm)) * Sebastian Mueller ([@seb-mueller)](https://github.com/seb-mueller)) * Charles Girardot ([@cgirardot)](https://github.com/cgirardot)) +* Tom Kelly ([@TomKellyGenetics)](https://github.com/TomKellyGenetics)) ## Usage @@ -35,7 +36,7 @@ In any case, if you use this workflow in a paper, don't forget to give credits t ### Step 2: Configure workflow -Configure the workflow according to your needs via editing the file `config.yaml` and the `samples.tsv` following those [instructions](https://github.com/Hoohm/dropSeqPipe/wiki/Create-config-files) +Configure the workflow according to your needs via editing the file `config.yaml` and the `samples.csv` following those [instructions](https://hoohm.github.io/dropSeqPipe/) ### Step 3: Execute workflow @@ -51,9 +52,9 @@ Execute the workflow locally via using `$N` cores on the `$WORKING_DIR`. Alternatively, it can be run in cluster or cloud environments (see [the docs](http://snakemake.readthedocs.io/en/stable/executable.html) for details). -If you not only want to fix the software stack but also the underlying OS, use +If you not only want to fix the software stack but also the underlying OS (using Singularity), use - snakemake --use-conda --use-singularity + snakemake --use-singularity in combination with any of the modes above. diff --git a/docs/docs/Clusters.md b/docs/docs/Clusters.md deleted file mode 100644 index c4a5a36..0000000 --- a/docs/docs/Clusters.md +++ /dev/null @@ -1,15 +0,0 @@ -Running on clusters ----------------------------------- -There is a file in the `templates` called `cluster.yaml`. This can be used to modify ressources needed for your data. I generally recommand moving the file to the root of the folder so that it doesn't get replaced by updates. - -Bellow is an example of running on a cluster using the template file `cluster.yaml` on SLURM. - -``` -snakemake --cluster 'sbatch -n {cluster.n} -t {cluster.time} --clusters=CLUSTERNAME --output={cluster.output}' --jobs N --cluster-config cluster.yaml --use-conda --local-cores C -``` - -* N: is the number of jobs you are allowed to run at the same time -* C: is the local-cores of the host machine. A few simple rules are gonna be run locally (not sent to nodes) because they are not that heavy (mostly plotting) -* CLUSTERNAME: the name of the cluster you want to use - -Note: The default path for cluster logs in the cluster.yaml is `logs/cluster/`. If that folder doesn't exist, our cluster can't write and will crash without an error message. \ No newline at end of file diff --git a/docs/docs/Create-config-files.md b/docs/docs/Create-config-files.md deleted file mode 100644 index 1cf86b8..0000000 --- a/docs/docs/Create-config-files.md +++ /dev/null @@ -1,156 +0,0 @@ -# Config file and sample file ---------------------------- - -In order to run the pipeline you will need to complete the config.yaml file and the samples.csv file. Both are located in the `templates` folder , should be moved to the root folder of the experiment and filled in for missing entries before running the pipeline. - -The goal for this is to provide the config.yaml when you finally upload the data to a repository for a publication as well as the pipeline version. This provides other users to ability to rerun the processing from scratch exactly as you did. This is possible because snakemake will download and create the exact same environnment for each rule using the envs files provided with the pipeline. - -## 1. config.yaml - Executables, system and experiment parameters -The config.yaml contains all the necessary parameters and paths for the pipeline. -``` -CONTACT: - email: user.name@provider.com - person: John Doe -LOCAL: - temp-directory: /tmp - memory: 4g - raw_data: - results: -META: - species: - mus_musculus: - build: 38 - release: 94 - homo_sapiens: - build: 38 - release: 91 - ratio: 0.2 - reference-directory: /path/to/references/ - gtf_biotypes: gtf_biotypes.yaml -FILTER: - barcode_whitelist: '' - 5-prime-smart-adapter: AAAAAAAAAAA - cell-barcode: - start: 1 - end: 12 - UMI-barcode: - start: 13 - end: 20 - cutadapt: - adapters-file: 'adapters.fa' - R1: - quality-filter: 20 - maximum-Ns: 0 - extra-params: '' - R2: - quality-filter: 20 - minimum-adapters-overlap: 6 - minimum-length: 15 - extra-params: '' -MAPPING: - STAR: - genomeChrBinNbits: 18 - outFilterMismatchNmax: 10 - outFilterMismatchNoverLmax: 0.3 - outFilterMismatchNoverReadLmax: 1 - outFilterMatchNmin: 0 - outFilterMatchNminOverLread: 0.66 - outFilterScoreMinOverLread: 0.66 -EXTRACTION: - LOCUS: - - CODING - - UTR - strand-strategy: SENSE - UMI-edit-distance: 1 - minimum-counts-per-UMI: 0 -``` -Please note the "space" after the colon, is needed for the yaml to work. - -## Subsections - -### [CONTACT] -* `email` and `person` This is not requested. You can provide the e-mail and name address of the person who processed the data using this configuration. Ideally you should provide the config.yaml with the data repository to allow people to rerun the data using dropSeqPipe. - -### [LOCAL] -* `temp-directory` is the temp or scratch folder with enough space to keep temporary files. -* `memory` is the maximum memory allocation pool for a Java Virtual Machine. -* `raw_data` is the folder containing all your raw fastq.gz files. -* `results` is the folder that will contain all the results of the pipeline. - -### [META] -* `species` is where you list the species of your samples. It can be a mixed experiment with two entries. -* `SPECIES_ONE` can be for example: mus_musculus, homo_sapiens, etc... It has to be the name used on ensembl for automatic download to work. -* `build` is the genome build number. -* `release` is the annotation release number. -* `SPECIES_TWO` can be your second species. -* `ratio` is how much "contamination" from another species you allow to validate them as a species or mixed. 0.2 means you allow a maximum of 20% mixing. -* `reference-directory` is where you want to store your references files. -* `gtf_biotypes` is the gtf_biotypes.yaml file containing the selection of biotypes you want to keep for your gene to read attribution. Using less biotypes may decrease your multimapping counts. - -### [FILTER] -* `barcode_whitelist` is the filename of your whitelist fi you have one. Well plate base protocols often have one. -* `5-prime-smart-adapter` is the 5" smart adapter used in your protocol. -* `cell-barcode and UMI-barcode`: Is the section for cell/umi barcode filtering. - * `start` is the first base position of your cell/umi barcode. - * `end` is the last base position of your cell/umi barcode. -* `cutadapt`: Is the section for trimming. -* `adapters-file` is the file containing your list of adapters as fasta. you can choose between 6 files in the `templates` folder, add any sequence to existing files or provide your own custom one. - * NexteraPE-PE.fa - * TruSeq2-PE.fa - * TruSeq2-SE.fa - * TruSeq3-PE-2.fa - * TruSeq3-PE.fa - * TruSeq3-SE.fa -Provide the path to the file you want to use for trimming. If you want to add custom sequences or create a complete new one, I would advise to store it in the ROOT folder of the experiment. This will ensure that your custom file will not be overwritten if you update the pipeline. - -Example: `NexteraPE-PE.fa` -* `R1` lists the options for read1 (cell barcode and umi) filtering/trimming - * `quality-filter` is the minimum mean score of the sliding window for quality filtering. - * `maximum-Ns` how many Ns you allow in the cell barcode and umi barcode. By default it is one because we want to be able to collapse barcodes that have one mismatch. - * `extra-params` if you usually add extra paramters to cutadapt, you can do it here. *Only for experienced cutadapt users*. -* `R2` lists the options for read2 (mRNA) filtering/trimming -that have one mismatch. - * `maximum-length` is the maximum length of your mRNA read before alignement. - * `extra-params` if you usually add extra paramters to cutadapt, you can do it here. *Only for experienced cutadapt users*. -For more information about trimming and filtering please visit the [cutadapt](https://cutadapt.readthedocs.io/en/stable/guide.html) website. - -### [MAPPING] -* `STAR` - * `genomeChrBinNbits` is a value used for index generation in STAR. The formula is min(18,int(log2(genomeLength/referenceNumber))) - * `outFilterMismatchNmax` (default:10) is the maximum number of mismatches allowed. - * `outFilterMismatchNoverLmax` (default:0.3) is the maximum ratio of mismatched bases that mapped. - * `outFilterMismatchNoverReadLmax` (default:1.0) is the maximum ratio of mismatched bases of the whole read. - * `outFilterMatchNmin` (default:0) is the minimum number of matched bases. - * `outFilterMatchNminOverLread` (default:0.66) alignment will be output only if the ratio of matched bases is higher than or equal to this value. - * `outFilterScoreMinOverLread` (default:0.66) alignment will be output only if its ratio score is higher than or equal to this value. - -All of the values for STAR are the default ones. For details about STAR parameters and what they do, please refer to the [STAR manual on git](https://github.com/alexdobin/STAR/tree/master/doc). - -### [EXTRACTION] -* `LOCUS` are the overlapping regions that reads overlap and are counted in the final expression matrix. Possible values are `CODING`, `UTR`, `INTRON` -* `UMI-edit-distance` This is the maximum manhattan distance between two UMI barcode when extracting count matrices. -* `min-count-per-umi` is the minimum UMI/Gene pair needed to be counted as one. -* `strand-strategy` `SENSE` defines that you only count genes where the forward strand mapped to the forward region on the DNA. Other possibilities are `ANTISENSE` (only count reads that mapped on the opposite strand) or `BOTH` (count all). - -# 2. samples.csv - Samples parameters -This file holds the sample names, expected cell numbers and read length for each sample. -The file has to have this format: - -``` -samples,expected_cells,read_lengths,batch -sample_name1,500,100,Batch1 -sample_name2,500,100,Batch2 -``` - -* `expected_cells` is the amount of cells you expect from your sample. -* `read_length` is the read length of the mRNA (Read2). This is necessary for STAR index generation -* `batch` is the batch of your sample. If you are added new samples to the same experiment, this is typically a good place to add the main batch. - -`Note:` You can add any other column you wish here, it won't affect the pipeline and you can use it later on in your analysis. - -Finally, you can now [run the pipeline](https://github.com/Hoohm/dropSeqPipe/wiki/Running-dropSeqPipe) - -or - -Create a [custom reference](https://github.com/Hoohm/dropSeqPipe/wiki/Reference-Files) - diff --git a/docs/docs/Installation.md b/docs/docs/Installation.md index 991d56b..b15976f 100644 --- a/docs/docs/Installation.md +++ b/docs/docs/Installation.md @@ -1,7 +1,11 @@ -This pipeline is dependent on conda. +#Installation -### Step 1: Download and install miniconda3 -First you need to download and install miniconda3: +DropSeqPipe is dependent upon the software management system `conda`. +First step therefore is to install `miniconda3` a small version of `Anaconda` that includes conda, Python, the packages they depend on and a small number of other useful packages such as pip and zlib. + + +## Step 1: Download and install miniconda3 +First download and install miniconda3 by executing the commands below on the command line. for linux ``` @@ -15,33 +19,53 @@ curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o M bash Miniconda3-latest-MacOSX-x86_64.sh ``` +The command `conda` should now be available and can be tested by running `conda` on the command line. +If the command doesn’t work, more information can be found [here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) + +## Step 2: Install snakemake -### Step 2: Clone the workflow +To run the pipeline [snakemake]( https://snakemake.readthedocs.io/en/stable/) is needed which will be installed using conda -Clone the worflow ``` -git clone https://github.com/Hoohm/dropSeqPipe.git +conda install -c bioconda -c conda-forge snakemake ``` +The command `snakemake` should now be available and can be tested by running `snakemake` on the command line. If the command doesn’t work, more detailed instruction can be found [here](https://snakemake.readthedocs.io/en/stable/) -### Step 3: Install snakemake +## Step 3: Obtain pipeline + +In order to download the actual pipeline either download it directly or use [git]( https://www.git-scm.com/) to clone the workflow. + +The pipeline can be directly downloaded [here] +(https://github.com/Hoohm/dropSeqPipe/archive/master.zip) + +It is however recommended to download the pipeline by cloning using git rather than downloading the zip archive. +If `git` hasn’t been installed yet `conda` can be used to do so. ``` -conda install -c bioconda -c conda-forge snakemake +conda install -c bioconda -c conda-forge git ``` - -Next step is config files completion -[Complete the config.yaml](https://github.com/Hoohm/dropSeqPipe/wiki/Create-config-files) with the missing information +Next the workflow can be obtained using the following command. + +``` +git clone https://github.com/Hoohm/dropSeqPipe.git +``` + +This should result in the set-up of a directory called `dropSeqPipe` which contains the pipeline. -### UPDATES: How to update the pipeline -Go to your experiment folder, then pull. + +## How to update the pipeline + +Go to the experiment folder, then pull. + ``` -git pull https://github.com/Hoohm/dropSeqPipe.git +git pull ``` -If you want to update files/plots based on the updates you can use this command: +To update files/plots, use this command: + ``` snakemake -R `snakemake --list-codes-changes` ``` -This will update all the files that would be modified by the changes in the code (rules or script). Depending on how much and where the changes have been made, this might rerun the whole pipeline. \ No newline at end of file +This will update all the files that would be modified by the changes in the code (rules or script). Depending on how much and where the changes have been made, this might rerun the whole pipeline. diff --git a/docs/docs/Output_misc.md b/docs/docs/Output_misc.md new file mode 100644 index 0000000..1ce635d --- /dev/null +++ b/docs/docs/Output_misc.md @@ -0,0 +1,29 @@ +# Other output + +Apart from plots, dSP also generates various other output, such as logs, tables, R objects, reports and intermediated files. + +## Tables +Various summary statistics are also computed such as `barcode_stats_pre_filter.csv` and `barcode_stats_post_filter.csv`: + +###' barcode_stats_pre_filter.csv +Various statistic (Total_raw_reads, Nr_barcodes_total Nr_barcodes_more_than_1_reads Nr_barcodes_more_than_10_reads percentile99 percentile95 percentile50 Gini-index Reads_assigned_to_expected_STAMPs) + +### barcode_stats_post_filter.csv + +Statistics for barcodes that were taken forward as STAMPs as set as `expected_cells` in `config.yaml`, including: +Total_nb_reads Nb_STAMPS Median_reads_per_STAMP Mean_reads_per_STAMP Total_nb_UMIs Median_UMIs_per_STAMP Mean_UMIs_per_STAMP Mean_UMIs_per_Gene Median_number_genes_per_STAMP Mean_number_genes_per_STAMP Mean_Ribo_pct Mean_Mito_pct Mean_Count_per_UMI Read_length Number_barcodes_used_for_debug Pct_reads_after_filter_expected_cells Pct_reads_after_filter_everything + +## R objects + +### Seurat + +A Seurat object is generated automatically which can be used straight away using Seurat and can be found here: +`summary/R_Seurat_objects.rdata` +Note, dSP up to version 0.5 produces Seurat v2 object. Newer version (dSP 0.6+) generate Seurat v3 objects. + +As from dSP version 0.6 Seurat, you can import the object as follows: + +''' +library(Seurat) +dspData <- readRDS("path/to/working_dir/results/summary/R_Seurat_objects.rdata") +''' diff --git a/docs/docs/Plots.md b/docs/docs/Plots.md index 6778376..278e65b 100644 --- a/docs/docs/Plots.md +++ b/docs/docs/Plots.md @@ -1,97 +1,103 @@ -On of the main purpose of this package is getting information about your data to improve your protocol and filter your data for further downstream analysis. +One of the main purposes of this package is getting information about the data to improve the protocol and filter the data for further downstream analysis. Below, is a list of plots and reports that will be generated by the pipeline. - -Here is a list of plots and reports that you will get from the pipeline. +## 1. QC reports Fastqc, STAR and cutadapt reports are generated as [multiqc reports](http://multiqc.info/docs/#using-multiqc-reports) in the reports folder. -## 1. Adapter content +## 2. Adapter content ![Adapter content](images/adapter_content.pdf) -On the x axis are the samples. -On the y axis are the percentages of total adapters that have been found (and trimmed) in respective fastq files based on the `adapter-file` provided via `config.yaml`. +This plot provides an idea of the which adapter has been found in which proportion in each sample. The top plot shows results for read 1, the bottom one for read, samples are assigned on the x axis and the y axis are the percentages of total adapters that have been found (and trimmed) in respective fastq files based on the `adapter-file` provided via `config.yaml`. + +Below is a list of adapters that can be trimmed off: + +Illumina specific adapters: +*Illumina_Universal +*PrefixNX/1 +*Trans1 +*Trans1_rc +*Trans2 +*Trans2_rc +*Nextera + +Transcript specific adapters +*polyA +*polyT +*polyC +*polyG -The top plot is for read1 and the bottom for read2. +Drop-seq specific adapters +*drop-seq -This plot provides an idea of the which adapter has been found and in which proportion in each sample. -## 2. Yield (across samples) +## 3. Yield (across samples) ![Yield](images/yield.png) -On the x axis are the samples. +On the x axis are the samples. TOP: On the y axis are the number of reads attributed to each category. BOTTOM: On the y axis are the percentage of attributed to each category. -This plot gives you an overview of all the reads from your samples and how they are distributed in all the possible categories. The reads that are uniquely mapped ar the ones you will keep at the end for the UMI count matrix. +This plot gives an overview of the read distribution. Uniquely mapped reads are kept for the UMI count matrix. -## 3. Knee plot (per sample) +## 4. Knee plot ![Knee plot](images/sample1_knee_plot.png) -On the x axis is the cumulative fraction of reads per STAMPS (captured cell). +On the x axis is the cumulative fraction of reads per STAMPS (captured cells). On the y axis is the ordered STAMPS (based on total reads). -This allows you to determine how much of the reads you actually captured with the number of cells you expected. -The cutting is based on the `expected_cells` parameter in the `samples.csv` file. +This allows to determine number of cells captured and the amount of reads that assign to cells rather than background reads. The cutting is based on the `expected_cells` parameter in the `samples.csv` file. The green `selected cells` are the cells that are going to be in the final expression matrix. -If you see a clear bend on the plot that is higher in the number of cells than what you expected, you should increase the `expected_cells` value and rerun the `extract` step. If it is under, I would advise to filter out your data with a downstream analysis tool such as Seurat. -*Note: I advise not to try to discover "real" cells/STAMPS at this stage. I suggest to extract the expected number of cells and filter out later in post-processing with other kind of meta data.* +If the bend on the plot is higher at a higher number of cells than what was expected increase the `expected_cells` value and rerun the `extract` step. If it is under, filter data with a downstream analysis tool such as Seurat. -## 4. RNA metrics (per sample) +## 5. RNA metrics (per sample) ![RNA metrics](images/sample1_rna_metrics.png) -On the x axis are top barcodes based on your `expected_cells` values or the `barcodes.csv` file. -Top plot: On the y axis are the number of bases classified by region of mapping. +On the x axis are top barcodes based on the `expected_cells` values or the `barcodes.csv` file. +Top plot: On the y axis are the number of bases classified by region of mapping. Bottom plot: On the y axis are the percentage of bases classified by region of mapping. -This plot gives a lot of different informations. The top plot allows you to quickly compare cells between them in terms of how much has been mapped. This can sometimes help identify outliers or bad runs. -The bottom plot allows you to find cells that have an "abnormal" mapped base distribution compared to other cells. - +The top plot allows for quick comparison of mapping rates between cells, to identify outliers or bad runs. The bottom plot allows to find cells that have an "abnormal" mapped base distribution compared to other cells. -## 5. Violine plots for barcode properties (across samples) +## 6. Violine plots for barcode properties (across samples) ![Violine plots](images/mac_violinplots_comparison_UMI.png) Various statistic for barcodes that were taken forward as STAMPs as set as `expected_cells` in `config.yaml`. Each point represents a barcode augmented by a violine-plot density estimator of barcode distribution along the y-axis. -On the x axis are the samples for each panel (Note: the dot distribution along the x-axis does't not bear information, it's just a visual aid to better assess density). -On the y axis are the respecitve statistics described below for each panel. +On the x-axis are the samples for each panel (Note: the dot distribution along the x-axis doesn’t bear information, it's just a visual aid to better assess density). +On the y axis are the respective statistics described below for each panel. -TOP panel from left to right: +TOP panel from left to right: - nUMI: number of UMI per barcode - nCounts: number of Counts per barcode -- top50: fraction (percentage/100) of the highest expressed genes compared to entire set of genes. +- top50: fraction (percentage/100) of the highest expressed genes compared to entire set of genes. -BOTTOM: +BOTTOM: - nUMI: average number of UMI per Gene per barcode - pct.Ribo: Fraction of ribosomal RNA (Note: ribsomal transcripts defined as starting with "^Rpl") - pct.mito: Fraction of mitochondrial RNA (Note: mitchondrial transcripts defined as starting with "^mt-") -## 6. Saturation plot: UMI per barcode (across samples) +## 7. Saturation plot: UMI per barcode (across samples) ![umi per barcode](images/mac_UMI_vs_gene.png) -Number of UMI (x-axis) vs number of Genes (y-axis) for each barcode (points in plot) broken down by sample (different colors). -Number of Genes defined as Genes having at least 1 read mapped to them. -Individual samples are color-coded. A loess regression curve of barcodes for each sample is fitted. -Various statistic for barcodes that were taken forward as STAMPs as set as `expected_cells` in `config.yaml`. +Number of UMIs (x-axis) vs number of Genes (y-axis) for each barcode (points in plot) broken down by sample. Number of Genes defined as Genes having at least 1 read mapped to them. Individual samples are color-coded. A loess regression curve of barcodes for each sample is fitted. -This plot can indicate how many counts per barcode are required on average to find all expressed genes in a cell. -Given enought coverage, it can also indicate how many genes are expressed for the examined cell type. +This plot can indicate how many counts per barcode are required on average to find all expressed genes in a cell. Given enough coverage, it can also indicate how many genes are expressed for the examined cell type. -## 7. Saturation plot: Counts per barcode (across samples) +## 8. Saturation plot: Counts per barcode (across samples) ![counts per barcode](images/mac_Count_vs_gene.png) -Number of Counts (x-axis) vs number of Genes (y-axis) for each barcode (points in plot) broken down by sample (different colors). -Number of Genes defined as Genes having at least 1 read mapped to them. -Individual samples are color-coded. A loess regression curve of barcodes for each sample is fitted. +Number of Counts (x-axis) vs number of Genes (y-axis) for each barcode (points in plot) broken down by sample (different colors). +Number of Genes defined as genes having at least 1 read mapped to them. +Individual samples are color-coded. A loess regression curve of barcodes for each sample is fitted. Various statistic for barcodes that were taken forward as STAMPs as set as `expected_cells` in `config.yaml`. -## 8. Counts per UMI per barcode (across samples) +## 9. Counts per UMI per barcode (across samples) ![counts per UMI](images/mac_UMI_vs_counts.png) -Number of UMI (x-axis) vs number of Counts (y-axis) for each barcode (points in plot) broken down by sample (different colors). -Individual samples are color-coded. A loess regression curve of barcodes for each sample is fitted. -Black line indicate an optimal 1:1 ratio between UMI and Counts (i.e. no Duplicates!) - -This plots can give an indication on the level of duplication for each sample. The close to black line the lower duplication. +Number of UMI (x-axis) vs number of Counts (y-axis) for each barcode (points in plot) broken down by sample (different colors). +Individual samples are color-coded. A loess regression curve of barcodes for each sample is fitted. +Black line indicates an optimal 1:1 ratio between UMI and Counts -# Mixed experiment +These plots can give an indication on the level of duplication for each sample. The closer a dataset aligns to the black line the lower the duplication rate. -## 9. Barnyard plot (per sample) +## 10. Barnyard plot (mixed experiment only) ![Barnyard plot](images/hum_mus_species_plot_transcripts.png) -This plot shows you species purity for each STAMPS. Mixed and No call STAMPS are dropped and only single species are kept for extraction. -You can change the minimum ratio of transcripts to define a STAMP as mixed or not in the configfile with: `species_ratio` -You get one plot for genes and one plot for transcripts. The selection is done on the transcript level. +Only produces in [mixed species experiment](Setting-up-an-experiment.md) +This plot shows species purity for each sample either based on transcripts or genes. +Mixed and No call STAMPS are dropped and only single species are kept for extraction. The selection is done on the transcript level. +The minimum ratio of transcripts can be altered in `config.yaml`under `species_ratio` to define a STAMP as mixed. diff --git a/docs/docs/Reference-Files.md b/docs/docs/Reference-Files.md deleted file mode 100644 index 8ad4003..0000000 --- a/docs/docs/Reference-Files.md +++ /dev/null @@ -1,50 +0,0 @@ -Reference files ------------------ -From version 0.4 on, reference files are automatically downloaded by the pipeline. Mixed references are also downloaded and merged automatically. Since sometimes you still want to use your own reference you can bypass the download by creating your own `genome.fa` and `annotation.gtf` file. - -Snakemake generates file based on paths. If you want to use a custom reference you have to name it properly for snakemake to find it. - -Here is an example: - -Let's assume this is you configuration for the META section: -``` -META: - species: - funky_species_name: - build: A - release: 1 - ratio: 0.2 - reference-directory: /absolute/path/to/references - gtf_biotypes: gtf_biotypes.yaml -``` - -You need to provide the following files - -``` -/absolute/path/to/references/funky_species_name_A_1/genome.fa -/absolute/path/to/references/funky_species_name_A_1/annotation.gtf -``` - -This will stop dropSeqPipe from downloading a new reference. - - -Once the pipeline has run completely, the folder will look like this: - -``` -genome.fa -annotation.gtf -annotation.refFlat -annotation_reduced.gtf -genome.consensus_introns.intervals -genome.dict -genome.exons.intervals -genome.genes.intervals -genome.intergenic.intervals -genome.rRNA.intervals -STAR_INDEX/SA_read_length/ -``` - -Note: The STAR index will be built based on the read length of your mRNA read (Read2). -If you have different lengths, it will produce multiple indexes. - -Finally, you can now [run the pipeline](https://github.com/Hoohm/dropSeqPipe/wiki/Running-dropSeqPipe) \ No newline at end of file diff --git a/docs/docs/Running-dropSeqPipe.md b/docs/docs/Running-dropSeqPipe.md index 448662b..943d70a 100644 --- a/docs/docs/Running-dropSeqPipe.md +++ b/docs/docs/Running-dropSeqPipe.md @@ -1,106 +1,84 @@ -Example ------------------------ -The pipeline is to be cloned once and then run on any folder containing the configuration files and your raw data. The workingdir folder can contain multiple runs (aka batches) as you can easily add new samples when recieving new data and run the same commands. This will simply run the pipeline on the newly added data and recreate reports as well as plots containing all the samples. - -Example: You run 2 biological conditions with 2 replicates. This makes up for 4 samples. Assume a simple dropseq protocol with only human cells. -1. You sequence the data and recieve the 8 files (two files per sample) and download the pipeline -2. You run the pipeline with the command: `snakemake --use-conda --cores N --directory WORKING_DIR`. `N` being the number of cores available and `WORKING_DIR` being the folder containing your `config.yaml`, `samples.csv`, adapter file and `gtf_biotypes.yaml`. -3. You see that there is an issue with the protocol and you modify it -4. You create a new set of libraries and sequence them (same 2x2 design) -5. You add the new files in the data folder of `WORKING_DIR` and edit the samples.csv to add missing samples. -6. You run the pipeline as you did the first time `snakemake --use-conda --cores N --directory WORKING_DIR` -7. This will run the new samples only and recreate the reports as well as the yield plots. -8. It is now easy to compare the impact of your change in the procotol - -Working dir folder preparation ----------------- -The raw data from the sequencer should be stored in the `RAW_DATA` folder of `WORKING_DIR` folder like this: -``` -/path/to/your/WORKING_DIR/ -| -- RAW_DATA/ -| -- -- sample1_R1.fastq.gz -| -- -- sample1_R2.fastq.gz -| -- -- sample2_R1.fastq.gz -| -- -- sample2_R2.fastq.gz -| samples.csv -| config.yaml -| barcodes.csv -| adapters.fa -``` -*Note: In DropSeq or ScrbSeq you expect a paired sequencing. R1 will hold the information of your barcode and UMI, R2 will hold the 3' end of the captured mRNA.* +# Running the pipeline +Once everything is in place, the pipeline can be run using the snakemake command plus appropriate parameter. +It is recommended to take a [look at the options](http://snakemake.readthedocs.io/en/latest/) that are available since only a few of them will be covered below. -Once everything is in place, you can run the pipeline using the normal snakemake commands. +## The pipeline modes -Running the pipeline (TLDR version) ----------------------------- +* `meta`: Downloads and generates all the subsequent references files and STAR index needed to run the pipeline. Run this alone to just create the meta-data file before running a new set of data. +* `qc`: Creates fastqc reports of the data. +* `filter`: Generates `sample_filtered.fastq.gz` files from `sample_R1.fastq.gz` files which are ready to be mapped to the genome. This step filters out data for low quality reads and trims the adapters provided in the FILTER section. +* `map`: Mapping the reads (filtered fastq) to the reference. Goes from sample_filtered.fastq.gz to `sample_final.bam`. +* `extract`: Extracts the expression data such as a UMI and a count expression matrix. -For a simple single cell run you only need to run: `snakemake --cores N --use-conda --directory WORKING_DIR` -This will run the whole pipeline and use the X number of cores you gave to it. +There are two main ways to run the pipeline. +Either using a single command line that will run the whole pipeline or run each mode separately. +The second approach is especially recommended when working on a new protocol. +Since v`0.4` the pipeline detects mixed experiments if two instead of one reference is specified in `config.yaml`. +Note, the stepwise approach is not available for mixed experiments. +For both ways, one needs to first navigate to the dSP install directory (which contains the `Snakefile`). -Running the pipeline ---------------------------------- +## Running all modes in a single command +Run `snakemake all --use-conda --directory WORKING_DIR` which will run everything without stopping. -I highly recommend to take a [look at the options](http://snakemake.readthedocs.io/en/latest/) that are available since I won't cover everything here. +## Running each mode separately +The main advantage of the second way is that parameters can be fine tune based on the results of individual steps such as fastqc, filtering and mapping quality. -Modes ------------------------------- -You have two main ways to run the pipeline. +There are seven different modes available and call the mode to run it specifically. -You can either just run `snakemake --use-conda --directory WORKING_DIR` in the root folder containing your experiment and it will run everything without stopping. +Example: +To run the `qc` mode: +``` +snakemake --cores 8 qc --use-conda --directory WORKING_DIR +``` +To run multiple modes at the same time: +``` +snakemake --cores 8 qc filter --use-conda --directory WORKING_DIR +``` -You can also run each step separately. The main advantage of the second way is that you are able to fine tune your parameters based on the results of fastqc, filtering, mapping quality, etc... -I would suggest using the second approach when you work on a new protocol and the first one when you are confident of your parameters. -There are seven different modes available and to run one specifically you need to call the mode. +## Useful snakemake parameters -Example: To run the `qc` mode: `snakemake --cores 8 qc --use-conda --directory WORKING_DIR` -You can also run multiple modes at the same time if you want: `snakemake --cores 8 qc filter --use-conda --directory WORKING_DIR` +* `--cores N` Use this argument to use X amount of cores available. +* `--notemp` Use this command to keep all temporary files. Without this option, only files between steps are kept. This option is useful to troubleshoot the pipeline or to analyze in between files. +* `--dryrun` or `-n` Use this to perform a dry run of the command. This is useful to check for potential missing files. +* `-p` Print out the shell commands that will be executed. +* `--verbose` Print debugging output. -### Single species: -* `meta`: Downloads and generates all the subsequent references files and STAR index needed to run the pipeline. You can run this alone if you just want to create the meta-data file before running a new set of data. -* `qc`: Creates fastqc reports of your data. -* `filter`: Go from sample_R1.fastq.gz to sample_filtered.fastq.gz ready to be mapped to the genome. This step filters out your data for low quality reads and trims adapter you provided in the FILTER section. -* `map`: Go from sample_filtered.fastq.gz to the sample_final.bam read to extract the expression data. This maps the data to the genom. -* `extract`: Extract the expression data. You'll get a umi and a count expression matrix from your whole experiment. -### Mixed species -Since v`0.4` the pipeline detects mixed experiments on the fly. Simply run `snakemake --directory WORKING_DIR --use-conda`. The stepwise approach is not available for mixed experiments. +## Software environments -Barcode whitelist ---------------------- -In protocols such as SCRBseq, the expected barcodes sequences are known. This pipeline also does allow the use of known barcodes instread of a number of expected cells. -In order to use this functionnality you just need to add a whitelist barcode file and provide the name of the file in the configuration in the section: +### conda -``` -FILTER: - barcode_whitelist: name_of_your_whitelist_file -``` -The file should be in the WORKING_DIR. Run the pipeline as usual. +If `--use-conda` is specified, snakemake downloads all necessary software in the working directory (in the hidden folder .snakemake). -Advanced options -------------------- -If you have some specific adapters that are not present by default in the ones in the `templates` folder, you can add whatever adapters you want to trim (as many as you need) following the fasta syntax. +By default, the .snakemake directory is created for each working directory (i.e. experiment), consequently downloading the entire software environments for each experiment.To avoid this, a central conda location can be specified with `--conda-prefix dir` e.g.: -``` -FILTER: - cutadapt: - adapters-file: name_of_your_adapter_file.fa -``` +snakemake –use-conda --conda-prefix ~/.conda/myevns + +We highly recommend this step as do [others](https://bioinformatics.stackexchange.com/questions/6914/running-snakemake-in-one-single-conda-env) + +*Note: Its likely to run into problems when using `--use-cond` when running the pipeline while being in active conda environment. +Deactivate `conda` beforehand, if in doubt use `conda deactivate`. -Further options ---------------------- -* `--cores N` Use this argument to use X amunt of cores available. -* `--notemp` Use this to not delete all the temporary files. Without this option, only files between steps are kept. Use this option if you are troobleshooting the pipeline or you want to analyze in between files yourself. -* `--dryrun` or `-n` Use this to check out what is going to run if you run your command. This is nice to check for potential missing files. +### Singularity +Since version 0.5 Singularity is supported. A container with all software is now provided and automatically used if `--use-singularity` is used. +Again, this can be shared with other datasets using shared directory `--singularity-prefix ~/.singularity/` -Folder Structure ------------------------ -This is the folder structure you get in the end: +Note: This is still immature, and some errors might be solved by specifying the reference-dir as an argument like so: +`--singularity-args="--bind path/to/reference-dir"` + +The collection of container can be found here: + +https://singularity-hub.org/collections/3535 + +## Folder Structure + +This is the folder structure in the end: ``` /path/to/your/WORKING_DIR/ | -- RAW_DATA/ @@ -118,14 +96,35 @@ This is the folder structure you get in the end: | .snakemake/ ``` -* `RAW_DATA/` Contains all your samples as well as the intermediary files +* `RAW_DATA/` Contains all the samples as well as the intermediary files * `RESULT_DIR/logs/` Contains all the logfiles generated by the pipeline * `RESULT_DIR/logs/cluster` Contains all the logfiles generated by the cluster * `RESULT_DIR/plots/` Contains all the plots generated by the pipeline * `RESULT_DIR/reports/` Contains all the reports generated by the pipeline -* `RESULT_DIR/summary/` Contains all the files you might use for downstream analysis (contains barcodes selected per sample per species, final umi/counts expression matrix) -* `RESULT_DIR/samples/` Contains all the sample specific files. Bam files, barcodes used, single sample expression files, etc... +* `RESULT_DIR/summary/` Contains all the files which might be used for downstream analysis (barcodes selected per sample and per species, final umi/counts expression matrix) +* `RESULT_DIR/samples/` Contains all the sample specific files, such as bam files, barcodes used and single sample expression files. * `samples.csv` File containing sample details * `config.yaml` File containing pipeline parameters as well as system parameters -* `adapters.fa` File containing all the adapters you wish to trim from the raw data. -* `.snakemake/` Folder that contains all the environements created for the run as well as a lot of other things. \ No newline at end of file +* `adapters.fa` File containing all the adapters for trimming from raw data. +* `.snakemake/` Folder that contains amongst other things all the environments created for the run. + +## testrun + +A minimal test dataset for testing is provided with the pipeline but needs to be initialized first (in dSP directory): + +``` +git submodule update --init --recursive +``` + +Sets up test data in the `.test` dir. The pipeline can be tested by this minimal command: + +``` +snakemake --dir .test +``` + +or adding some sensible options: + +``` +snakemake --keep-going --use-conda -rp --conda-prefix ~/.conda/myevns --directory .tes +t +``` diff --git a/docs/docs/Setting-up-an-experiment.md b/docs/docs/Setting-up-an-experiment.md new file mode 100644 index 0000000..9cb1fb9 --- /dev/null +++ b/docs/docs/Setting-up-an-experiment.md @@ -0,0 +1,295 @@ +# Setting up an experiment + +The pipeline needs to be installed as shown previously. +In order to run the pipeline, a separate folder (working directory) needs to be created containing raw sequencing data as well as all required configuration files. +DSP will then be run individually on each of the `workingdir` which contain the sample files. +The `workingdir` folder can contain multiple runs (batches) allowing new samples to be added and analysed easily. +Adding new samples, will only run the pipeline on the newly added data and recreate reports as well as plots containing all the samples. +This automatically done by snakemake and will save unecesarry computation performed already. + +Example: If running 2 biological conditions with 2 replicates. This makes up for 4 samples. Assuming a simple dropseq protocol with only human cells. +1. Sequence the data, download the respective 8 data files (2 files per sample) and install the pipeline. +2. Run the pipeline with the command: `snakemake --use-conda --cores N --directory WORKING_DIR`. `N` being the number of cores available and `WORKING_DIR` being the folder containing the `config.yaml`, `samples.csv`, adapter file and `gtf_biotypes.yaml`. +3. If new samples have to be added, create a new set of libraries and sequence them (same 2x2 design) +5. Add the new files in the data folder of `WORKING_DIR` and edit the samples.csv to add missing samples. +6. Run the pipeline like the first time `snakemake --use-conda --cores N --directory WORKING_DIR` +7. This will run the new samples only and recreate the reports as well as the yield plots. +8. It is now easy to compare the impact of the changes in the protocol + +## Working dir folder preparation + +The raw data from the sequencer are stored in the `RAW_DATA` folder of the `WORKING_DIR`, which should be structured as shown below: + +``` +/path/to/your/WORKING_DIR/ +| -- RAW_DATA/ +| -- -- sample1_R1.fastq.gz +| -- -- sample1_R2.fastq.gz +| -- -- sample2_R1.fastq.gz +| -- -- sample2_R2.fastq.gz +| samples.csv +| config.yaml +| gtf_biotypes.yaml +| +| custom_adapters.fa +``` +Further to the `RAW_DATA` folder the `WORKING_DIR` should contain the `samples.csv`, `config.yaml` , `gtf_biotypes.yaml` and `custom_adapters.fa` which are explained in more detail in the following chapters. + + +*Note: DropSeq or ScrbSeq contain two files per sample as they are methods using paired end sequencing. +R1 will hold the information of the barcode and UMI, R2 will hold the 3' end of the captured mRNA.* + +Optional: If a barcode whitelist is available like in methods such as ScrbSeq add ` barcodes.csv` into the working directory. + + + +## Config file and sample file + +In order to run the pipeline complete the `config.yaml` file and the `samples.csv` file. +Templates for both are located in the `templates` folder, which can be copied to the `WORKING_DIR` of the experiment and filled in for missing entries before running the pipeline. + +*Note: Ideally the pipeline version and the config files, specifically `config.yaml` and `samples.csv` should be provided with the uploaded data to a repository for a publication. +This provides other users to ability to rerun the processing from scratch with exactly the same parameters. +This is possible because `snakemake` will download and create the exact same conda software environment (if `--use-conda` parameter is used) for each rule using the `envs` files (`envs` files are conda software environment files provided with the pipeline i.e. yaml files contained in the `envs` directory). + +### 1. config.yaml +The config.yaml contains all the necessary parameters and paths for the pipeline. + +Example + +``` +CONTACT: + email: user.name@provider.com + person: John Doe +LOCAL: + temp-directory: /tmp + memory: 4g + raw_data: + results: +META: + species: + mus_musculus: + build: 38 + release: 94 + homo_sapiens: + build: 38 + release: 91 + ratio: 0.2 + reference-directory: /path/to/references/ + gtf_biotypes: gtf_biotypes.yaml +FILTER: + barcode_whitelist: '' + 5-prime-smart-adapter: AAAAAAAAAAA + cell-barcode: + start: 1 + end: 12 + UMI-barcode: + start: 13 + end: 20 + cutadapt: + adapters-file: 'adapters.fa' + R1: + quality-filter: 20 + maximum-Ns: 0 + extra-params: '' + R2: + quality-filter: 20 + minimum-adapters-overlap: 6 + minimum-length: 15 + extra-params: '' +MAPPING: + STAR: + genomeChrBinNbits: 18 + outFilterMismatchNmax: 10 + outFilterMismatchNoverLmax: 0.3 + outFilterMismatchNoverReadLmax: 1 + outFilterMatchNmin: 0 + outFilterMatchNminOverLread: 0.66 + outFilterScoreMinOverLread: 0.66 +EXTRACTION: + LOCUS: + - CODING + - UTR + strand-strategy: SENSE + UMI-edit-distance: 1 + minimum-counts-per-UMI: 0 +``` + *Note: The "space" after the colon, is needed for `config.yaml` to work. + +#### Subsections + +[CONTACT] +* `email` and `person`: Provide the name and e-mail address of the person who processed the data using this configuration, this however is not mandatory. + +[LOCAL] +* `temp-directory` is the temp or scratch folder with enough space to keep temporary files. +* `memory` is the maximum memory allocation pool for a Java Virtual Machine. +* `raw_data` is the folder containing all the raw fastq.gz files. +* `results` is the folder that will contain all the results of the pipeline. + +[META] +* `species` is where the list the species reference genomes of the samples. It can be one species or two for a mixed species experiment. +* `SPECIES_ONE` can be for example: mus_musculus, homo_sapiens, etc... It has to be the name used on ensembl for automatic download to work. +* `build` is the genome build number. +* `release` is the annotation release number. +* `SPECIES_TWO` can be the second species. +* `ratio` is how much "contamination" from another species is allowed to validate them as a species or mixed. 0.2 means a maximum of 20% mixing is allowed. +* `reference-directory` is where the references files are stored. +* `gtf_biotypes` is the `gtf_biotypes.yaml` file containing the selection of biotypes for the gene to read attribution. Using less biotypes may decrease the multimapping counts. + +[FILTER] +* `barcode_whitelist` is the filename of the barcode whitelist. +* `5-prime-smart-adapter` is the 5-prime smart adapter used in the protocol. +* `cell-barcode and UMI-barcode`: Is the section for cell/UMI barcode filtering. + * `start` is the first base position of the cell/UMI barcode. + * `end` is the last base position of the cell/UMI barcode. +* `cutadapt`: Is the section for trimming. +* `adapters-file` path to the file containing a list of adapters as fasta to use for trimming. See section below for further information. + +Example: `NexteraPE-PE.fa` +* `R1` lists the options for read1 (cell barcode and UMI) filtering/trimming + * `quality-filter` is the minimum mean score of the sliding window for quality filtering. + * `maximum-Ns` how many Ns (N stands for unknown base - instead of A,T,C or G) are allowed in the cell barcode and UMI barcode. By default it is one because to be able to collapse barcodes that have one mismatch. + * `extra-params` extra parameters to cutadapt can be added here *Only for experienced cutadapt users*. +* `R2` lists the options for read2 (mRNA) filtering/trimming +that have one mismatch. + * `maximum-length` is the maximum length of the mRNA read before alignment. + * `extra-params` extra parameters to cutadapt can be added here *Only for experienced cutadapt users*. +For more information about trimming and filtering please visit the [cutadapt](https://cutadapt.readthedocs.io/en/stable/guide.html) website. + +[MAPPING] +* `STAR` + * `genomeChrBinNbits` is a value used for index generation in STAR. The formula is min(18,int(log2(genomeLength/referenceNumber))) + * `outFilterMismatchNmax` (default:10) is the maximum number of mismatches allowed. + * `outFilterMismatchNoverLmax` (default:0.3) is the maximum ratio of mismatched bases that mapped. + * `outFilterMismatchNoverReadLmax` (default:1.0) is the maximum ratio of mismatched bases of the whole read. + * `outFilterMatchNmin` (default:0) is the minimum number of matched bases. + * `outFilterMatchNminOverLread` (default:0.66) alignment will be output only if the ratio of matched bases is higher than or equal to this value. + * `outFilterScoreMinOverLread` (default:0.66) alignment will be output only if its ratio score is higher than or equal to this value. + +All of the values for STAR are the default ones. For details about STAR parameters and what they do, please refer to the [STAR manual on git](https://github.com/alexdobin/STAR/tree/master/doc). + +[EXTRACTION] +* `LOCUS` are the regions where reads overlap and are counted in the final expression matrix. Possible values are `CODING`, `UTR`, `INTRON` +* `UMI-edit-distance` This is the maximum manhattan distance between two UMI barcode when extracting count matrices. +* `min-count-per-umi` is the minimum UMI/Gene pair needed to be counted as one. +* `strand-strategy` Given the factor `SENSE` this parameter defines that only genes are counted where the forward strand mapped to the forward region on the DNA. Other possibilities are `ANTISENSE` (only counts reads that mapped on the opposite strand) or `BOTH` (count all). + +### 2. samples.csv - Samples parameters +This file holds the sample names, expected cell numbers and read length for each sample. +The file has to have this format: + +``` +samples,expected_cells,read_lengths,batch +sample_name1,500,100,Batch1 +sample_name2,500,100,Batch2 +``` + +* `expected_cells` is the number of cells expected from a sample. +* `read_length` is the read length of the mRNA (Read2). This is necessary for STAR index generation +* `batch` Can be used to group samples into batches. If new samples are added to the same experiment, a new batch should be assigned to those samples. + +`Note:` Any other column can be added here, it won't affect the pipeline and it can be used later on in the analysis. + +### 3. gtf_biotypes.yaml + +TODO + +### 4. custom_adapters.fa + +File containing the list of adapters as fasta. The files can be chosen in the `templates` folder and any sequence can be added to existing files or custom files can be provided. + * NexteraPE-PE.fa + * TruSeq2-PE.fa + * TruSeq2-SE.fa + * TruSeq3-PE-2.fa + * TruSeq3-PE.fa + * TruSeq3-SE.fa + +If custom sequences are added or new files are created, ist recommended to to store them in the `WORKING_DIR` folder of the experiment. This will ensure that the custom file will not be overwritten if the pipeline is updated. + +### 5. Barcode whitelist [optional] + +In protocols such as SCRBseq, the expected barcodes sequences are known beforhand. +This pipeline does allow the use of known barcodes instead of a number of expected cells. +In order to use this functionality add a whitelist barcode file and provide the name of the file in the configuration in the section: + +``` +FILTER: + barcode_whitelist: name_of_your_whitelist_file +``` +The file should be in the WORKING_DIR. Run the pipeline as usual. + + +### 6. cluster.yaml - Running on clusters [optional] + +There is a file in the `templates` called `cluster.yaml`. +This can be used to modify resources needed for the data. +Its recommended to move the file to the root of the folder so that it doesn't get replaced by updates. + +Below is an example of running on a cluster using the template file `cluster.yaml` for the SLURM Workload Manager used by many of the world's supercomputers and computer clusters (https://slurm.schedmd.com). + +``` +snakemake --cluster 'sbatch -n {cluster.n} -t {cluster.time} --clusters=CLUSTERNAME --output={cluster.output}' --jobs N --cluster-config cluster.yaml --use-conda --local-cores C +``` + +* N: is the number of jobs that are allowed to run at the same time +* C: is the local core of the host machine. A few simple rules are going to be run locally (not sent to nodes) because they are not that heavy (mostly plotting) +* CLUSTERNAME: the name of the cluster that is used. + +Note: The default path for cluster logs in the `cluster.yaml` is `logs/cluster/`. If that folder doesn't exist, the cluster can't write and will crash without an error message. + +## Choosing the mapping reference + +The mapping reference is usually a genome where the reads are aligned to same as for bulk RNA-Seq. +A list of available genomes can be found on ENSEMBL (e.g. release 98): +http://ftp.ensembl.org/pub/release-98/gtf/ + +From version 0.4 on, reference files are automatically downloaded by the pipeline into a directory specified in the `config.yaml` (`reference-directory: path`). +Mixed references are also downloaded and merged automatically based on the information in the `META` section of `config.yaml` (i.e. if two organism rather than one are listed in the `species: ` section, it will assume it is mixed species). + +To use a custom reference and bypass the reference download, create a custom `genome.fa` and `annotation.gtf` file. +Snakemake generates file based on paths. +If a custom reference is used, name it as shown below to enable snakemake to find it. + +Here is an example: + +Let's assume this is the configuration for the `META` section of `config.yaml`: +``` +META: + species: + funky_species_name: + build: A + release: 1 + ratio: 0.2 + reference-directory: /absolute/path/to/references + gtf_biotypes: gtf_biotypes.yaml +``` + +Provide the following files + +``` +/absolute/path/to/references/funky_species_name_A_1/genome.fa +/absolute/path/to/references/funky_species_name_A_1/annotation.gtf +``` + +This will stop dropSeqPipe from downloading a new reference. + + +Once the pipeline has run completely, the `reference-directory` will look like this: + +``` +genome.fa +annotation.gtf +annotation.refFlat +annotation_reduced.gtf +genome.consensus_introns.intervals +genome.dict +genome.exons.intervals +genome.genes.intervals +genome.intergenic.intervals +genome.rRNA.intervals +STAR_INDEX/SA_read_length/ +``` + +Note: The STAR index will be built based on the read length of the mRNA read (Read2). +If there are different read lengths, it will produce multiple indexes. diff --git a/docs/docs/index.md b/docs/docs/index.md index 02c2bd8..ccd190c 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -1,6 +1,18 @@ Welcome ------------------------------ -Welcome to the documentation of dropSeqPipe v`0.4`. +Welcome to the documentation of dropSeqPipe v`0.5`. -This documentation is still under construction. \ No newline at end of file +DropSeqPipe (dSP)is a pipeline for the analysis of single cell data. +The pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/) and the dropseq tools provided by the [McCarroll Lab](http://mccarrolllab.com/dropseq/). It allows to go from raw data of a single cell experiment to the final count matrix including extensive QC readouts. +DSP provides an easy framework to reproduce and compare different experiments with different parameters using STAR as an aligner. It is usable for any single cell protocol using paired-end reads where the first read holds the cell and UMI barcodes and the second read the transcript. +Below is a non-exhaustive list of compatible protocols and brands: + +* Drop-Seq +* SCRB-Seq +* 10x Genomics +* DroNc-seq +* Dolomite Bio ([Nadia Instrument](https://www.dolomite-bio.com/product/nadia-instrument/)) + +This package is trying to be as user friendly as possible. Even though it requires command line executions, the hope is that non-bioinformatician would be able to use it. +Note: A short demonstration on how to to install and use dSP can be found in [this webinar by Sebastian Mueller](https://www.youtube.com/watch?v=4bt-azBO-18). diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index a4ff1ca..1db7388 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -4,15 +4,13 @@ repo_name: 'GitHub' repo_url: https://github.com/Hoohm/dropSeqPipe nav: - 'index.md' - - 'Installation.md' - - 'Reference-Files.md' - - 'Create-config-files.md' + - 'Setting-up-an-experiment.md' - 'Running-dropSeqPipe.md' - - 'Clusters.md' - 'Plots.md' + - 'Output_misc.md' - 'CHANGELOG.md' - 'FAQ.md' google_analytics: - 'UA-128943644-1' - - 'auto' \ No newline at end of file + - 'auto'