- Fixes linear grouped counts output #258.
- Fixed duplicated
noninformative
andintergenic
reads assignments. As a results, fixed duplicated novel transcripts #236.
- Fixes exon counting algorithm #229.
-
Fixed YAML support in visualization #222.
-
Fixed transcript naming when IsoQuant-generated GTF is provided as input #219.
-
Fixed
exons
attribute duplication #219. -
Exon ids are now consistent between input and output annotations if present.
-
New
--count_format
option for setting desired grouped counts format (matrix/linear/both), fixes #223.
-
New visualization software developed by @jackfreeman88. See more here.
-
Dramatically reduced RAM consumption for grouped counts, about 10-20x decrease on datasets with large number of groups. Important fix for single-cell data processing. Should fix #189.
-
Fixed #195: output GTF contained very similar isoforms and estimated their expression as 0.
-
New documentation is now available at ablab.github.io/IsoQuant.
- Dramatically reduced RAM consumption. Should fix #209.
IsoQuant 3.4.2 was tested on a simulated ONT dataset with 30M reads using 12 threads.
In the default mode RAM consumption decreased from 280GB to 12GB when using
the reference annotation and from 230GB down to 6GB in the reference-free mode.
Running time in the default mode increased by approximately 20-25%.
When using --high_memory
option, running time remains the same as in 3.4.1,
RAM consumption in the reference-based mode is 46GB, and 36GB in the reference-free mode.
Note, that in general RAM consumption depends on the particular data being used and the number of threads.
In brief, in 3.4.0 and 3.4.1 inadequate RAM consumption was caused by
this commit.
Apparently, adding a couple of int
fields to the BasicReadAssignment
class made the default pickle serialization
not to clean used memory (possibly, a leak). Since some large lists of BasicReadAssignment
were sent between
processes, this caused the main process to consume unnecessary RAM. When later new processes were created
for GTF construction, total RAM consumption exploded thanks to the way Python multiprocessing works.
This release implements two ways fixing the issue: sending objects via disk (default) and
using custom pickle serialization (when --high_memory
is used).
- Transcript and exon ids are now identical between runs, including ones with different number of threads.
-
Fixed
IndexError: list index out of range
when--sqanti_output
is set (#186). -
Fixed
IndexError: list index out of range
in printing grouped transcript models TPMs (#187). -
Reduced running time when
--sqanti_output
is set.
Major novelties and improvements:
-
Significant speed-up on datasets containing regions with extremely high coverage, often encountered on mitochondrial chromosomes (#97).
-
Added support for Illumina reads for spliced alignment correction (thanks to @rkpfeil).
-
Added support YAML files (thanks to @rkpfeil). Old options
--bam_list
and--fastq_list
are still availble, but deprecated since this version.
Transcript discovery and GTF processing:
-
Fixed missing genes in extended GTF (#140, #147, #151, #175).
-
Fixed strand detection and output of transcripts with
.
strand (#107). -
Added
--report_canonical
and--polya_requirement
options that allows to control level of filtering of output transcripts based on canonical splice sites and the presence of poly-A tails. (#128) -
Added check for input GTFs (#155).
-
Extract CDS, other features and attributes from reference GTF to the output GTFs (#176).
-
Reworked novel gene merging procedure (#164).
-
Revamped algorithm for assigning reads to novel transcripts and their quantification (#127).
Read assignment and quantification:
-
Optimized read-to-isoform assignment algorithm.
-
Added
gene_assignment_type
attribute to read assignments. -
Fixed duplicated records in
read_assignments.tsv
(#168). -
Improved gene and transcript quantification. Only unique assignments are now used for transcript quantification. Added more options for quantification strategies (
--gene_quantification
and--transcript_quantification
). -
Improved consistency between
trascript_counts.tsv
andtranscript_model_counts.tsv
(#137). -
Introduced mapping quality filtering:
--min_mapq
,--inconsistent_mapq_cutoff
and--simple_alignments_mapq_cutoff
(#110).
Minor fixes and improvements:
-
Added
--bam_tags
option to import additional information from BAM files to read assignments output. -
Large output files are now gzipped by default,
--no_gzip
can be used to keep uncompressed output (#154). -
BAM stats are now printed to the log (#139).
-
Various minor fixes and requests (#106, #141, #143, #146, #179).
Special acknowledgement to @almiheenko for testing and reviewing PRs, and to @alexandrutomescu for supporting the project.
- Fixed
UnboundLocalError: local variable 'match' referenced before assignment
error in SQANTI-like output.
-
Fixed read to novel models assignment.
-
Improved command line options for providing multiple files, added
--prefix
option. -
Additional checks for various unusual cases in input GTFs.
-
Do not output empty files when no GTF is provided.
-
Unspliced novel transcripts are not reported by the default for ONT data, use
--report_novel_unspliced
to generate them. -
When multiple BAM/FASTQ files are provided via
--bam
/--fastq
, they are treated as different replicas/samples of the same experiment; a single GTF and per-sample counts are generated automatically. -
10-15 times lower RAM consumption with the same running time.
-
~5 times lower disk consumption for temporary files.
-
--low_memory
option has no effect (used by default);--high_memory
mimics old behavior by storing alignments in RAM. -
Read assignment reports transcript start and end (TSS/TES) matches.
-
--sqanti_output
generates SQANTI-like output for novel vs reference transcripts. -
Resulting annotation contains exon ids.
-
Supplementary gene attributes are copied from the reference annotation to the output annotations.
-
Improved
--resume
and--force
behaviour. -
--model_construction_strategy sensitive_pacbio
is now more sensitive.
-
Fixed strand detection that caused lower precision for novel transcripts.
-
Fixed known transcript filtering that caused lower recall.
-
Fixed duplicate transcript entries in the output annotation.
-
Fixed duplicate canonical attribute in extended annotation.
-
Fixed
--resume
option when relative paths were provided.
-
Fixed error caused by introns of length 0 (strange corner case, but it does happen).
-
Fixed error when using a read grouping file.
-
Implement
--resume
option for resuming failed runs. -
Fix SQANTI-like output for raw reads.
-
Fix read strand detection, improves transcript discovery as well.
-
Simplify transcript naming, IDs of known transcripts are preserved in the output.
-
More information about novel transcripts in GTF
- Fix GTF attributes, thanks to @rsalz.
- Fix
--check_canonical
option.
-
Annotation-free mode for de novo transcript discovery.
-
Significant speed-up.
-
Extended annotation (all reference + novel transcripts) is now part of the output.
-
Intermediate BAM files have nicer names.
-
Proper single-thread mode without thread pool usage.
-
New options for controlling quantification strategies. Default behaviour is changed as well.
-
New option
--genedb_output
for providing a separate folder for gene database in case the output directory is located on a shared disk. -
Possibility to provide read group tables in gzipped format.
-
Fixed
--check_canonical
option. -
Improved running time for the read assignment step (noticeable only for genes with > 100 exons).
- Minor fixes and improvement in output files. Note, that GTFs and some other files have now multiline headers.
-
Parallel processing of transcript model construction phase.
-
Minor improvements in quantification of reference transcripts.
-
Fixed counts/TPM for novel transcript models.
-
Fixed processing of BAM records without sequence data (e.g. secondary alignment).
-
Fixed
list index out of range
bug in long read counter.
-
Improved recall by introducing relative coverage cutoffs.
-
More careful handling of transcript terminal positions.
-
Fixed GTF to BED conversion.
-
Completely new transcript discovery algorithm with significantly higher recall.
-
Algorithm for read alignment correction.
-
Support for technical replicas within a single sample.
-
Significantly improved running time and RAM consumption;
-
Annotation is now fed into minimap2;
-
Extended output format.
- Support for GFF3 mRNA features.
-
Support for BAM files with =/X in CIGAR strings;
-
Fixed canonical splice site detection.
-
Multi-threading;
-
Intermediate results are saved to disc to enable quick restart via --read_assignments option;
-
Significantly improved precision for novel transcript detection;
-
Secondary alignments are now used by default;
-
Fixed several bugs in inconsistency detection algorithm;
-
Reworked polyA detection and reporting once again;
-
Slightly modified read assignment output format;
-
More informative GTF output;
-
Removed --has_polya option, --polya_trimmed is now used as the opposite;
-
Added --check_canonical option.
-
Significantly reworked polyA detection and reporting;
-
Improved detection of inconsistencies, added several new event types;
-
Better recall and precision for read assignment algorithm;
-
Fixed several bug and flaws;
-
Added script for counting simple stats for GTF files (srt/gtf_stats.py).
-
Initial release.