-
Notifications
You must be signed in to change notification settings - Fork 1
Mapped reads
When it comes to mapped reads, there is a single standard file format that is used: SAM (or its more efficient binary brother BAM). However, it has been designed mainly with genome resequencing in mind, and for some applications it is a not very convenient way of working with reads. In Gedi, we have an alternative way of storing and indexing interval-based data (which also includes things like genes, transcripts, transcription factor binding sites, ...): CIT (centered interval trees). Its main advantages are:
- It is based on principled data structures designed for efficient access to interval data (i.e. interval trees)
- We use a variant that makes it particularly useful for non-memory based indices (i.e. it's centered)
- We can store any payload data (e.g. mapped reads with or without information on mismatches, transcripts names and contained ORFs,...)
- Data can be compressed
- For read data, a single record contains all reads or read pairs that are mapped to a specific position in the genome (which is a very space efficient way of storing them)
- For read data, a single CIT file can contain the information of several samples (e.g. replicates, conditions)
Of course all tools built on top of Gedi can utilize either CIT files or BAM files.
Most Gedi programs have a parameter -reads
. This parameter either accepts a single BAM file, a single CIT file (containing mapped reads), or a bamlist
file (which contains, in each line, the name of or path to a bam file; lines starting with # are ignored; the extensions must be .bamlist
).
Example (test.bamlist
):
# this is a bamlist file containing three replicates of two conditions
C1.A.bam
C1.B.bam
C1.C.bam
C2.A.bam
C2.B.bam
C2.C.bam
Many of our tools run much faster based on cit files than on several bam files. To convert several bamfiles to a single cit file, create a bamlist file (see above), and call the bamlist2cit bash script like this:
bamlist2cit xyz.bamlist
As CIT is a proprietary format of Gedi, it comes with a little tool to extract data from such files. Its full usage is
gedi -e ViewCIT <Options> <file>
Options:
-l Only list the number of elements per reference
-q <location> Only output elements overlapping the given query (i.e. whole reference or genomic region)
-o <file> Specify output file (Default: stdout)
-m <mode> output mode: Bed/Location/Cit (Default: location)
-name <js> javascript function body to generate name (variable d is current data, ref the reference, reg the region)
-score <js> javascript function body to generate score (variable d is current data, ref the reference, reg the region)
-p Show progress
-h Show this message
-D Output debugging information
Examples:
gedi -e ViewCIT -l hek293.cit
List the number of entries per chromosome for the given file (containing mapped reads). The number of entries is not the number of reads, but the number of distinct genomic regions (defined by start and end position of a mapped reads and the positions of potential introns). The lesson here is that an entry may correspond to more than one read, either because the same sequence has been observed multiple times, or distinct sequences map to the same genomic region due to mismatches.
Output:
hek293.cit: gedi.core.data.reads.DefaultAlignedReadsData
1+ 368315
1- 387836
2+ 212989
2- 202875
3+ 175598
3- 172153
4+ 110494
4- 102068
5+ 204240
...
gedi -e ViewCIT -q 1+:27322311-27322411 -m bed -name 'd.getId(0)+""' hek293.cit
Extract all reads mapped to the watson strand of chromosome 1, overlapping the region 27322311-27322411 in bed format. For the name column in the bed file, output the read id from the first of potentially many distinct sequences mapping at that location. The -name
parameter can refer to any Javascript code executed using the Nashorn JS engine. It has full access to the Gedi library and can refer to the variables ref
(the ReferenceSequence object), reg
(the GenomicRegion object) and d
(the data object, in this case an AlignedReadData object). Obviously, this is a very powerful mechanism but requires deep knowledge of the internals of Gedi.
Output:
1+:27322293-27322315 132480
1+:27322345-27322367 132490
1+:27322329-27322347 132487
1+:27322320-27322346 132484
1+:27322290-27322312 132477
1+:27322359-27322382 132494
1+:27322380-27322403 132507
1+:27322364-27322386 132500
1+:27322288-27322316 132476
1+:27322353-27322380 132493
1+:27322328-27322353 132486
1+:27322291-27322315 132479
1+:27322365-27322391|27330719-27330720 132504
...
gedi -e ViewCIT -q 1+:27322311-27322411 price/hek293.merged.codons.cit
Extract inferred codons with their activity values in the locations format. Output:
1+:27322340-27322343 [4.0]
1+:27322392-27322395 [1.0]
1+:27322347-27322350 [1.0]
1+:27322326-27322329 [5.0]
1+:27322377-27322380 [6.956845]
1+:27322332-27322335 [3.0]
1+:27322386-27322389 [2.0]
1+:27322365-27322368 [1.0]
1+:27322317-27322320 [1.0]
1+:27322374-27322377 [13.043155]
1+:27322354-27322357 [1.204386]
1+:27322359-27322362 [2.795614]
gedi -e ViewCIT -m bed -q '8+:27491575-27528868' -name 'd.getType()' -score 'd.getTotalActivity()' price/hsv.orfs.cit
Extract inferred ORFs in bed format with the ORF type in the name column and the total activity of the first (and, here, only) experiment in the score column. This might give a further impression about how powerful the Javascript access is. Output:
8 27491717 27528868 CDS 978 + 0 0 0 6 7,99,120,99,1044,452, 0,15501,17307,22581,24295,36699,
8 27491575 27491617 uORF 20 +
gedi -e ViewCIT -q 1+:27322311-27322411 -name 'd.getTranscriptId()' Homo_sapiens.GRCh38.86.gtf.index.cit
Extract transcripts from the index generated by IndexGenome and output them in location format. Output:
1+:27322159-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334281|27335378-27336400 ENST00000374076
1+:27322253-27322391|27330719-27330804|27330984-27331137|27332069-27332101|27333957-27334054|27334150-27334281|27334666-27334754|27335378-27335648 ENST00000486082
1+:27322280-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334896 ENST00000464813
1+:27322253-27322391|27330719-27330804|27332069-27332101|27332446-27332559 ENST00000466759
1+:27322299-27322391|27330719-27330804|27332069-27332101|27333360-27333435|27333957-27334054|27334150-27334281|27335378-27335545 ENST00000498220
1+:27322246-27322391|27330719-27330804|27330984-27331137|27332069-27332101|27333957-27334054|27334150-27334281|27335378-27336400 ENST00000464720
1+:27322247-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334281|27335378-27336400 ENST00000608611
1+:27322284-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334281|27334666-27334754|27335378-27335824 ENST00000478104
1+:27322144-27322391|27330719-27330804|27332069-27332101|27333957-27334054|27334150-27334281|27335378-27336399 ENST00000611517
To produce CIT files containing read mappings, one or several BAM files can be merged by
gedi -e Bam2CIT -p mapped.cit exp1.bam exp2.bam exp3.bam