Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update and amend simulation stage output file explanations #218

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 27 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -646,23 +646,39 @@ __Example runs:__

### 2. Simulation stage

1. `simulated_reads.fasta`
FASTA file of simulated reads. Each reads has "unaligned", "aligned", or "perfect" in the header determining their error rate. "unaligned" means that the reads have an error rate over 90% and cannot be aligned. "aligned" reads have the same error rate as training reads. "perfect" reads have no errors.

To explain the information in the header, we have two examples:
* `>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0`
All information before the first `_` are chromosome information. `468529` is the start position and `unaligned` suggesting it should be unaligned to the reference. The first `0` is the sequence index. `F` represents a forward strand. `0_3236_0` means that sequence length extracted from the reference is 3236 bases.
* `>ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2`
This is an aligned read coming from chromosome XI at position 115406. `16565` is the sequence index. `R` represents a reverse complement strand. `92_12710_2` means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.
#### read files

Two FASTA files of simulated reads are usually produced, or FASTQ files if the `--fastq` option is set:

1. `simulated_aligned_reads.fast(a|q)`
2. `simulated_unaligned_reads.fast(a|q)` (this file does not get generated, if you request `--perfect` reads without errors)

For `metagenome` mode simulations, these two files are produced for each simulated sample, with samples systematically named: `simulated_sample0_aligned_reads.fast(a|q), simulated_sample1_aligned_reads.fast(a|q), ...`

In these files, each read has `unaligned`, `aligned`, or `perfect` in the header recording their error rate:
* `unaligned` means that the reads have an error rate over 90% and cannot be aligned.
* `aligned` reads have the same error rate as training reads.
* `perfect` reads have no errors.

To explain the information in the header, we have two examples:
* `>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0`
All information before the first `_` are chromosome information. `468529` is the start position and `unaligned` suggesting it should be unaligned to the reference. The first `0` is the sequence index. `F` represents a forward strand. `0_3236_0` means that sequence length extracted from the reference is 3236 bases.
* `>ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2`
This is an aligned read coming from chromosome XI at position 115406. `16565` is the sequence index. `R` represents a reverse complement strand. `92_12710_2` means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.

The information in the header can help users to locate the read easily.
The information in the header can help users to locate the read easily.

__Specific to transcriptome simulation__: for reads that include retained introns, the header contains the information starting from `Retained_intron`, each genomic interval is separated by `;`.

__Specific to chimeric reads simulation__: for chimeric reads, different source chromosome and locations are separated by `;`, and there's a `chimeric` in the header to indicate.

#### error profile file

2. `simulated_error_profile`
Contains all the information of errors introduced into each reads, including error type, position, original bases and current bases.
This file contains all the information of errors introduced into each reads, including error type, position, original bases and current bases:

3. `simulated_aligned_error_profile`

For `metagenome` mode simulations, this file is produced for each simulated sample, with samples systematically named: `simulated_sample0_error_profile, simulated_sample1_error_profile, ...`


## Acknowledgements
Expand Down