diff --git a/docs/changelog.md b/docs/changelog.md index 3ad2293..7af4a44 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -6,7 +6,17 @@ nav_order: 99 # Version changelog - * **2.4.0**: + * **2.5.0**: + * Upcoming SMRT Link release + * Add [`lima-undo` functionality](/faq/undo) + * Support methylation tag clipping + * Add progress and ETA for `--log-level INFO` + * Rename `--preset` to [`--hifi-preset`](/faq/hifi-presets) + * Add barcoded adapter `--hifi-preset SYMMETRIC-ADAPTERS` + * Fixes to support stranded HiFi BAM input + * Do not abort on empty input, but warn only + + * 2.4.0: * Fix fasta/q input and `--guess` * Output empty files for missing barcode pairs `--output-missing-pairs` * Output each barcode into its own sub-directory `--split-subdirs` diff --git a/docs/faq/Speed.md b/docs/faq/Speed.md index 26d4fcd..b881dc7 100644 --- a/docs/faq/Speed.md +++ b/docs/faq/Speed.md @@ -5,22 +5,19 @@ title: Speed --- ## How fast is fast? -Example: 200 barcodes, asymmetric mode (try each barcode forward and -reverse-complement), 300,000 CCS reads. On my 2014 iMac with 4 cores + HT: +Example: 64 barcodes / asymmetric mode / 1.9M HiFi reads on a dual 64c EPYC system: - 503.57s user 11.74s system 725% cpu 1:11.01 total + Processed : 1912155 + Throughput: 2393135/min + Run Time : 48s 306ms + CPU Time : 2h 14m -Those 1:11 minutes translate into 0.233 milliseconds per ZMW, -1.16 microseconds per barcode for both sides aligning forward and reverse-complement, -and 291 nanoseconds per alignment. This includes IO. +That's 2.4M HiFi reads processed per minute on 128 physical CPU cores, including +IO. -## Why doesn't *lima* utilize the maximum number of provided cores? -This might be a simple IO bottleneck. With a barcode.fasta containing only a few -barcodes, most of the time is spent reading and writing BAM files, as the barcode -identification is too fast. Starting version 2.2.0, you can enable multi-threaded -BAM reading by setting the number of threads via an environment variable +## Is there a way to show the progress? +Yes, please use `--log-level INFO`. If there is a `.pbi` file present, the +estimated time will be shown. Otherwise, it will show progress as number of +reads every 5 seconds. - export PB_BAMREADER_THREADS=2 -## Is there a way to show the progress? -No. Please run `wc -l prefix.report` to get the number of already processed ZMWs. diff --git a/docs/faq/barcoded-adapter.md b/docs/faq/barcoded-adapter.md new file mode 100644 index 0000000..6f0770e --- /dev/null +++ b/docs/faq/barcoded-adapter.md @@ -0,0 +1,20 @@ +--- +layout: default +parent: FAQ +title: Barcoded Adapter +--- + +## Barcoded Adapter +The most convenient way to barcode a sample is the use of barcoded adapters, as +depicted in the [barcode design overview](barcode-design). One minor +disadvantage is that the ligation might not be as efficient as with standard +SMRTbell adapters, leaving some molecules only with one adapter. As barcoded +adapter designs are inherently symmetric, we implemented ways to recover the +demultiplexed yield from one-sided barcoded molecules with ease. + +As the first step, generate HiFi data with *ccs* v6.3.0 or later. This version +will store [additional tags per +records](https://ccs.how/faq/missing-adapters.html), indicating if the molecule +has missing adapters on either side. The second step is to use the new +`--hifi-preset SYMMETRIC-ADAPTERS` introduced with *lima* v2.5.0, [described +here](/faq/hifi-presets). That's it. diff --git a/docs/faq/biosample.md b/docs/faq/biosample.md index 297cec5..f57c9e1 100644 --- a/docs/faq/biosample.md +++ b/docs/faq/biosample.md @@ -22,3 +22,18 @@ relevant. Example: Provide this CSV to lima via `--biosample-csv input.csv`. This will associate the bio sample name to the read group using the `SM` tag. + +## UUID passthrough +Since *lima* v2.5.0, the functionality has been enhanced to allow specifying +UUIDs for the resulting XML files; for this, use `--reuse-uuids` in addition to +the extended csv for `--biosample-csv`. Example: + + Barcodes,UUID,Bio Sample + bc1001--bc1001,11111111-1111-1aaa-0111-111111111111,Alfred + bc1002--bc1002,22222222-2222-2bbb-8222-222222222222,Berthold + bc1003--bc1003,33333333-3333-3ccc-9222-333333333333,Constantin + bc1008--bc1008,e04f12c9-7b2e-45fd-ab49-1bc2f75d653a,Holger + +Ensure that the UUID matches the regex + + [0-9a-f]{8}-[0-9a-f]{4}-[0-5][0-9a-f]{3}-[089ab][0-9a-f]{3}-[0-9a-f]{12} diff --git a/docs/faq/hifi-presets.md b/docs/faq/hifi-presets.md new file mode 100644 index 0000000..85980b6 --- /dev/null +++ b/docs/faq/hifi-presets.md @@ -0,0 +1,22 @@ +--- +layout: default +parent: FAQ +title: HiFi Presets +--- + +## HiFi presets +With v2.5.0 we introduced the concept of recommended parameter presets called +`--hifi-preset`. All preset use + + --ccs --min-score 80 --min-end-score 50 --min-ref-span 0.75 + +in addition they differ as following + +| Preset | Definition | +| -------------------- | ------------------------------------- | +| `SYMMMETRIC` | `--same` | +| `SYMMETRIC-ADAPTERS` | `--same --ignore-missing-adapters` | +| `ASYMMETRIC` | `--different --min-scoring-regions 2` | + +For barcoded adapter libraries, `SYMMETRIC-ADAPTERS` will increase demultiplexed +yield. More info under [barcoded adapter FAQ](/faq/barcoded-adapter) diff --git a/docs/faq/how-to-run.md b/docs/faq/how-to-run.md index 1db228b..0922352 100644 --- a/docs/faq/how-to-run.md +++ b/docs/faq/how-to-run.md @@ -20,18 +20,15 @@ Run on CCS / HiFi data: $ lima .ccs.bam .fasta .bam $ lima .consensusreadset.xml .barcodeset.xml .consensusreadset.xml -If you do not need to import the demultiplexed data into SMRT Link, it is advised -to use `--no-pbi`, omit the pbi index file, to minimize time to result. - ### *Symmetric* or *Tailed* options CLR: --same - CCS: --same --ccs + CCS: --preset-hifi SYMMETRIC ### *Asymmetric* options CLR: --different - CCS: --different --ccs + CCS: --preset-hifi ASYMMETRIC ### Example execution diff --git a/docs/faq/primer.md b/docs/faq/primer.md index 273e3a4..85f438b 100644 --- a/docs/faq/primer.md +++ b/docs/faq/primer.md @@ -5,4 +5,5 @@ title: Primer removal --- ## Can I remove PCR primers after demultiplexing? -Yes! After demultiplexing, just lima on the output again with your PCR primer(s). +Yes! After demultiplexing, just call *lima* on the output again with your PCR +primer(s). diff --git a/docs/faq/split-output.md b/docs/faq/split-output.md index 94831d6..55d4389 100644 --- a/docs/faq/split-output.md +++ b/docs/faq/split-output.md @@ -9,7 +9,7 @@ You can either iterate over the `prefix.bam` file N times or use `--split-bam`. Each barcode has its own BAM file called `prefix.idxBest--idxCombined.bam`, e.g., `prefix.0--0.bam`. -The optional parameter `--split-bam-named`, names the files by their barcode names instead +The optional parameter `--split-named`, names the files by their barcode names instead of their barcode indices. Non-word characters, anything except [A-Za-z0-9_], in barcode names are replaced with an underscore in the file name. @@ -26,3 +26,11 @@ sequence is barcode `0` and the last barcode `numBarcodes - 1`. If you use output BAM splitting, it can happen that you get a lot of output files. Using `--files-per-directory N` creates subdirectories and outputs at most `N` barcodes per directory. + +## Split barcodes into own sub-directories +Since v2.5.0 each barcode can be stored in its own sub-directory: `--split-subdirs`. +A parent XML will point to all of the barcoded files. + +## Output missing barcodes +If you have provided bio samples with barcode pairs, option `--output-missing-pairs` +allows to create empty barcode files in all split modes. diff --git a/docs/faq/undo.md b/docs/faq/undo.md new file mode 100644 index 0000000..5ebecfe --- /dev/null +++ b/docs/faq/undo.md @@ -0,0 +1,43 @@ +--- +layout: default +parent: FAQ +title: Undo +--- + +## Undo demultiplexing +With the introduction of *lima* v2.5.0, it is possible to undo all +demultiplexing steps for **HiFi data**. For this, the bioconda package contains a +new `lima-undo` binary. + +Example: + + lima movie.hifi_reads.bam demux.consensusreadset.xml --hifi-preset SYMMETRIC --store-unbarcoded + lima-undo demux.consensusreadset.xml undo.bam + +Let's unroll what's happening. In the first line, we explicitly request to store +the unbarcoded reads. Without this, we would not be able to recover unbarcoded +reads. The `XML` contains all the file paths to the `BAM` files. The second call is +to the new *lima-undo* binary that takes a `XML` or `BAM` file as input and +ouput. + +Optionally, you can also provide multiple input `BAM` files with one output `BAM`: + + lima-undo demux.bam demux.unbarcoded.bam undo.bam + +This works also with split BAM files: + + lima-undo demux.bc1001-bc1001.bam demux.bc1002-bc1002.bam demux.unbarcoded.bam undo.bam + +## How does it work? +*lima* v2.5.0 and later stores everything that got clipped in an internal binary +structure in the `ls` tag. Multiple demultiplexing rounds are supported. Once +*lima-undo* gets called, for each read the individual demultiplexing steps get +reverted until the read is identical to the original HiFi read. + +## How can I check if undo results are correct? +How to check that the result is identical: + + samtools sort --no-PG -t "zm" undo.bam -o sorted.undo.bam + samtools view --no-PG sorted.undo.bam > undo.sam + samtools view --no-PG movie.hifi_reads.bam > original.sam + diff original.sam undo.sam diff --git a/docs/get-started.md b/docs/get-started.md index a569f78..61759c9 100644 --- a/docs/get-started.md +++ b/docs/get-started.md @@ -73,11 +73,11 @@ For CCS / HiFi data, use following compatibility matrix: HiFi run from *BAM* with **symmetric** barcodes: - lima .hifi_reads.bam barcodes.fasta .demux.bam --same --ccs --min-score 80 + lima .hifi_reads.bam barcodes.fasta .demux.bam --hifi-prefix SYMMETRICS HiFi run from *FASTQ* with **asymmetric** barcodes: - lima .hifi_reads.fq.gz barcodes.fasta .demux.fastq --different --ccs --min-score 80 + lima .hifi_reads.fq.gz barcodes.fasta .demux.fastq --hifi-prefix ASYMMETRIC CLR run from *XML* with **symmetric** barcodes: diff --git a/docs/img/lima_card_2022.png b/docs/img/lima_card_2022.png new file mode 100644 index 0000000..9d11908 Binary files /dev/null and b/docs/img/lima_card_2022.png differ diff --git a/docs/index.md b/docs/index.md index 9f93087..ff9b399 100644 --- a/docs/index.md +++ b/docs/index.md @@ -7,7 +7,7 @@ permalink: / ---

- lima logo + lima logo

*** @@ -23,11 +23,11 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie for information on Installation, Support, License, Copyright, and Disclaimer. ## Latest Version -Version **2.4.0**: [Full changelog here](/changelog) +Version **2.5.0**: [Full changelog here](/changelog) -## What's new! -New documentation is up, a 1:1 port from the original GitHub docs with minor -enhancements. Expect major enhancements in upcoming releases. +## What's new + * Recommended parameters via [`--hifi-preset`](/faq/hifi-presets) + * Undo demultiplexing via [`lima-undo`](/faq/undo) ## Get started If you are new to demultiplexing barcoded samples, check out the [Get Started guide](/get-started). diff --git a/docs/output/removed.md b/docs/output/removed.md index d9c1aad..6e8999d 100644 --- a/docs/output/removed.md +++ b/docs/output/removed.md @@ -1,9 +1,9 @@ --- layout: default parent: Output files -title: Removed +title: Unbarcoded --- -## Removed records -Using the option `--dump-removed`, records that did not pass provided thresholds +## Unbarcoded records +Using the option `--store-unbarcoded`, records that did not pass provided thresholds or are without barcodes, are stored in the file `prefix.removed.bam`.