Skip to content

Commit

Permalink
Version 2.5.0
Browse files Browse the repository at this point in the history
  • Loading branch information
armintoepfer committed Feb 23, 2022
1 parent ddeb080 commit 7ec7289
Show file tree
Hide file tree
Showing 13 changed files with 145 additions and 32 deletions.
12 changes: 11 additions & 1 deletion docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,17 @@ nav_order: 99

# Version changelog

* **2.4.0**:
* **2.5.0**:
* Upcoming SMRT Link release
* Add [`lima-undo` functionality](/faq/undo)
* Support methylation tag clipping
* Add progress and ETA for `--log-level INFO`
* Rename `--preset` to [`--hifi-preset`](/faq/hifi-presets)
* Add barcoded adapter `--hifi-preset SYMMETRIC-ADAPTERS`
* Fixes to support stranded HiFi BAM input
* Do not abort on empty input, but warn only

* 2.4.0:
* Fix fasta/q input and `--guess`
* Output empty files for missing barcode pairs `--output-missing-pairs`
* Output each barcode into its own sub-directory `--split-subdirs`
Expand Down
25 changes: 11 additions & 14 deletions docs/faq/Speed.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,19 @@ title: Speed
---

## How fast is fast?
Example: 200 barcodes, asymmetric mode (try each barcode forward and
reverse-complement), 300,000 CCS reads. On my 2014 iMac with 4 cores + HT:
Example: 64 barcodes / asymmetric mode / 1.9M HiFi reads on a dual 64c EPYC system:

503.57s user 11.74s system 725% cpu 1:11.01 total
Processed : 1912155
Throughput: 2393135/min
Run Time : 48s 306ms
CPU Time : 2h 14m

Those 1:11 minutes translate into 0.233 milliseconds per ZMW,
1.16 microseconds per barcode for both sides aligning forward and reverse-complement,
and 291 nanoseconds per alignment. This includes IO.
That's 2.4M HiFi reads processed per minute on 128 physical CPU cores, including
IO.

## Why doesn't *lima* utilize the maximum number of provided cores?
This might be a simple IO bottleneck. With a barcode.fasta containing only a few
barcodes, most of the time is spent reading and writing BAM files, as the barcode
identification is too fast. Starting version 2.2.0, you can enable multi-threaded
BAM reading by setting the number of threads via an environment variable
## Is there a way to show the progress?
Yes, please use `--log-level INFO`. If there is a `.pbi` file present, the
estimated time will be shown. Otherwise, it will show progress as number of
reads every 5 seconds.

export PB_BAMREADER_THREADS=2

## Is there a way to show the progress?
No. Please run `wc -l prefix.report` to get the number of already processed ZMWs.
20 changes: 20 additions & 0 deletions docs/faq/barcoded-adapter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
layout: default
parent: FAQ
title: Barcoded Adapter
---

## Barcoded Adapter
The most convenient way to barcode a sample is the use of barcoded adapters, as
depicted in the [barcode design overview](barcode-design). One minor
disadvantage is that the ligation might not be as efficient as with standard
SMRTbell adapters, leaving some molecules only with one adapter. As barcoded
adapter designs are inherently symmetric, we implemented ways to recover the
demultiplexed yield from one-sided barcoded molecules with ease.

As the first step, generate HiFi data with *ccs* v6.3.0 or later. This version
will store [additional tags per
records](https://ccs.how/faq/missing-adapters.html), indicating if the molecule
has missing adapters on either side. The second step is to use the new
`--hifi-preset SYMMETRIC-ADAPTERS` introduced with *lima* v2.5.0, [described
here](/faq/hifi-presets). That's it.
15 changes: 15 additions & 0 deletions docs/faq/biosample.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,18 @@ relevant. Example:
Provide this CSV to lima via `--biosample-csv input.csv`.

This will associate the bio sample name to the read group using the `SM` tag.

## UUID passthrough
Since *lima* v2.5.0, the functionality has been enhanced to allow specifying
UUIDs for the resulting XML files; for this, use `--reuse-uuids` in addition to
the extended csv for `--biosample-csv`. Example:

Barcodes,UUID,Bio Sample
bc1001--bc1001,11111111-1111-1aaa-0111-111111111111,Alfred
bc1002--bc1002,22222222-2222-2bbb-8222-222222222222,Berthold
bc1003--bc1003,33333333-3333-3ccc-9222-333333333333,Constantin
bc1008--bc1008,e04f12c9-7b2e-45fd-ab49-1bc2f75d653a,Holger

Ensure that the UUID matches the regex

[0-9a-f]{8}-[0-9a-f]{4}-[0-5][0-9a-f]{3}-[089ab][0-9a-f]{3}-[0-9a-f]{12}
22 changes: 22 additions & 0 deletions docs/faq/hifi-presets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
layout: default
parent: FAQ
title: HiFi Presets
---

## HiFi presets
With v2.5.0 we introduced the concept of recommended parameter presets called
`--hifi-preset`. All preset use

--ccs --min-score 80 --min-end-score 50 --min-ref-span 0.75

in addition they differ as following

| Preset | Definition |
| -------------------- | ------------------------------------- |
| `SYMMMETRIC` | `--same` |
| `SYMMETRIC-ADAPTERS` | `--same --ignore-missing-adapters` |
| `ASYMMETRIC` | `--different --min-scoring-regions 2` |

For barcoded adapter libraries, `SYMMETRIC-ADAPTERS` will increase demultiplexed
yield. More info under [barcoded adapter FAQ](/faq/barcoded-adapter)
7 changes: 2 additions & 5 deletions docs/faq/how-to-run.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,15 @@ Run on CCS / HiFi data:
$ lima <movie>.ccs.bam <barcodes>.fasta <demux>.bam
$ lima <movie>.consensusreadset.xml <barcodes>.barcodeset.xml <demux>.consensusreadset.xml

If you do not need to import the demultiplexed data into SMRT Link, it is advised
to use `--no-pbi`, omit the pbi index file, to minimize time to result.

### *Symmetric* or *Tailed* options

CLR: --same
CCS: --same --ccs
CCS: --preset-hifi SYMMETRIC

### *Asymmetric* options

CLR: --different
CCS: --different --ccs
CCS: --preset-hifi ASYMMETRIC

### Example execution

Expand Down
3 changes: 2 additions & 1 deletion docs/faq/primer.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ title: Primer removal
---

## Can I remove PCR primers after demultiplexing?
Yes! After demultiplexing, just lima on the output again with your PCR primer(s).
Yes! After demultiplexing, just call *lima* on the output again with your PCR
primer(s).
10 changes: 9 additions & 1 deletion docs/faq/split-output.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ You can either iterate over the `prefix.bam` file N times or use
`--split-bam`. Each barcode has its own BAM file called
`prefix.idxBest--idxCombined.bam`, e.g., `prefix.0--0.bam`.

The optional parameter `--split-bam-named`, names the files by their barcode names instead
The optional parameter `--split-named`, names the files by their barcode names instead
of their barcode indices. Non-word characters, anything except [A-Za-z0-9_],
in barcode names are replaced with an underscore in the file name.

Expand All @@ -26,3 +26,11 @@ sequence is barcode `0` and the last barcode `numBarcodes - 1`.
If you use output BAM splitting, it can happen that you get a lot of output files.
Using `--files-per-directory N` creates subdirectories and outputs at most `N`
barcodes per directory.

## Split barcodes into own sub-directories
Since v2.5.0 each barcode can be stored in its own sub-directory: `--split-subdirs`.
A parent XML will point to all of the barcoded files.

## Output missing barcodes
If you have provided bio samples with barcode pairs, option `--output-missing-pairs`
allows to create empty barcode files in all split modes.
43 changes: 43 additions & 0 deletions docs/faq/undo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
layout: default
parent: FAQ
title: Undo
---

## Undo demultiplexing
With the introduction of *lima* v2.5.0, it is possible to undo all
demultiplexing steps for **HiFi data**. For this, the bioconda package contains a
new `lima-undo` binary.

Example:

lima movie.hifi_reads.bam demux.consensusreadset.xml --hifi-preset SYMMETRIC --store-unbarcoded
lima-undo demux.consensusreadset.xml undo.bam

Let's unroll what's happening. In the first line, we explicitly request to store
the unbarcoded reads. Without this, we would not be able to recover unbarcoded
reads. The `XML` contains all the file paths to the `BAM` files. The second call is
to the new *lima-undo* binary that takes a `XML` or `BAM` file as input and
ouput.

Optionally, you can also provide multiple input `BAM` files with one output `BAM`:

lima-undo demux.bam demux.unbarcoded.bam undo.bam

This works also with split BAM files:

lima-undo demux.bc1001-bc1001.bam demux.bc1002-bc1002.bam demux.unbarcoded.bam undo.bam

## How does it work?
*lima* v2.5.0 and later stores everything that got clipped in an internal binary
structure in the `ls` tag. Multiple demultiplexing rounds are supported. Once
*lima-undo* gets called, for each read the individual demultiplexing steps get
reverted until the read is identical to the original HiFi read.

## How can I check if undo results are correct?
How to check that the result is identical:

samtools sort --no-PG -t "zm" undo.bam -o sorted.undo.bam
samtools view --no-PG sorted.undo.bam > undo.sam
samtools view --no-PG movie.hifi_reads.bam > original.sam
diff original.sam undo.sam
4 changes: 2 additions & 2 deletions docs/get-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,11 +73,11 @@ For CCS / HiFi data, use following compatibility matrix:

HiFi run from *BAM* with **symmetric** barcodes:

lima <movie>.hifi_reads.bam barcodes.fasta <movie>.demux.bam --same --ccs --min-score 80
lima <movie>.hifi_reads.bam barcodes.fasta <movie>.demux.bam --hifi-prefix SYMMETRICS

HiFi run from *FASTQ* with **asymmetric** barcodes:

lima <movie>.hifi_reads.fq.gz barcodes.fasta <movie>.demux.fastq --different --ccs --min-score 80
lima <movie>.hifi_reads.fq.gz barcodes.fasta <movie>.demux.fastq --hifi-prefix ASYMMETRIC

CLR run from *XML* with **symmetric** barcodes:

Expand Down
Binary file added docs/img/lima_card_2022.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 5 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ permalink: /
---

<p align="center">
<img src="img/lima_card.png" alt="lima logo" width="650px"/>
<img src="img/lima_card_2022.png" alt="lima logo" width="650px"/>
</p>

***
Expand All @@ -23,11 +23,11 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
for information on Installation, Support, License, Copyright, and Disclaimer.

## Latest Version
Version **2.4.0**: [Full changelog here](/changelog)
Version **2.5.0**: [Full changelog here](/changelog)

## What's new!
New documentation is up, a 1:1 port from the original GitHub docs with minor
enhancements. Expect major enhancements in upcoming releases.
## What's new
* Recommended parameters via [`--hifi-preset`](/faq/hifi-presets)
* Undo demultiplexing via [`lima-undo`](/faq/undo)

## Get started
If you are new to demultiplexing barcoded samples, check out the [Get Started guide](/get-started).
6 changes: 3 additions & 3 deletions docs/output/removed.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
layout: default
parent: Output files
title: Removed
title: Unbarcoded
---

## Removed records
Using the option `--dump-removed`, records that did not pass provided thresholds
## Unbarcoded records
Using the option `--store-unbarcoded`, records that did not pass provided thresholds
or are without barcodes, are stored in the file `prefix.removed.bam`.

0 comments on commit 7ec7289

Please sign in to comment.