Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change polyA-based and canonical splice site transcript filtration #134

Merged
merged 10 commits into from
Apr 9, 2024
32 changes: 25 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,9 @@ IsoQuant support all kinds of long RNA data:

Reads must be provided in FASTQ or FASTA format (can be gzipped). If you have already aligned your reads to the reference genome, simply provide sorted and indexed BAM files.

IsoQuant expect reads to contain polyA tails. For more reliable transcript model construction do not trim polyA tails.


<a name="sec1.2"></a>
## Supported reference data

Expand Down Expand Up @@ -519,6 +522,26 @@ The main explanation that some aligner report a lot of false unspliced alignment
for ONT reads.


`--report_canonical`
Strategy for reporting novel transcripts based on canonical splice sites, should be one of:

* `auto` - automatic selection based on the data type and model construction strategy (default);
* `only_canonical` - report novel transcripts, which contain only canonical splice sites;
* `only_stranded` - report novel transcripts, for which the strand can be unambiguously derived using splice sites and
presence of a polyA tail, allowing some splice sites to be non-canonical
* `all` -- report all transcript model regardless of their splice sites.


`--polya_requirement` Strategy for using polyA tails during transcript model construction, should be one of:

* `auto` - default behaviour: polyA tails are required if at least 70% of the reads have polyA tail;
polyA tails are always required for 1/2-exon transcripts when using ONT data (this is caused by elevated number of false 1/2-exonic alignments reported by minimap2);
* `never` - polyA tails are never required; use this option **at your own risk** as it may noticeably increase false discovery rate, especially for ONT data;
* `always` - reported transcripts are always required to have polyA support in the reads.

Note, that polyA tails are always required for reporting novel unspliced isoforms.



### Hidden options
<a name="hidden"></a>
Expand All @@ -528,9 +551,6 @@ We recommend _not_ to modify these options unless you are clearly aware of their
`--no_secondary`
Ignore secondary alignments.

`--report_unstranded`
Report transcripts for which the strand cannot be detected using canonical splice sites.

`--aligner`
Force to use this alignment method, can be `starlong` or `minimap2`; `minimap2` is currently used as default. Make sure the specified aligner is in the `$PATH` variable.

Expand All @@ -548,11 +568,9 @@ We recommend _not_ to modify these options unless you are clearly aware of their
for storing the annotation database, because SQLite database cannot be created on a shared disks.
The folder will be created automatically.

`--low_memory`
Deprecated, default behaviour since 3.2.

`--high_memory`
Cache read alignments instead for making several passes over a BAM file, noticeably increases RAM usage.
Cache read alignments instead for making several passes over a BAM file, noticeably increases RAM usage,
but may improve running time when disk I/O is relatively slow.

`--min_mapq`
Filers out all alignments with MAPQ less than this value (will also filter all secondary alignments, as they typically have MAPQ = 0).
Expand Down
86 changes: 54 additions & 32 deletions isoquant.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@
NANOPORE_DATA,
DataSetReadMapper
)
from src.dataset_processor import DatasetProcessor
from src.dataset_processor import DatasetProcessor, PolyAUsageStrategies
from src.graph_based_model_construction import StrandnessReportingLevel
from src.long_read_assigner import AmbiguityResolvingMethod
from src.long_read_counter import COUNTING_STRATEGIES
from src.input_data_storage import InputDataStorage
Expand All @@ -48,10 +49,10 @@ def bool_str(s):

def parse_args(cmd_args=None, namespace=None):
parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter)
ref_args_group = parser.add_argument_group('Reference data:')
input_args_group = parser.add_argument_group('Input data:')
output_args_group = parser.add_argument_group('Output naming:')
pipeline_args_group = parser.add_argument_group('Pipeline options:')
ref_args_group = parser.add_argument_group('Reference data')
input_args_group = parser.add_argument_group('Input data')
output_args_group = parser.add_argument_group('Output naming')
pipeline_args_group = parser.add_argument_group('Pipeline options')

other_options = parser.add_argument_group("Additional options:")
show_full_help = '--full_help' in cmd_args
Expand Down Expand Up @@ -150,16 +151,20 @@ def add_hidden_option(*args, **kwargs): # show command only with --full-help
help="gene quantification strategy", type=str, default="with_inconsistent")
add_additional_option("--report_novel_unspliced", "-u", type=bool_str,
help="report novel monoexonic transcripts (true/false), "
"default: False for ONT, True for other data types")
"default: false for ONT, true for other data types")
add_additional_option("--report_canonical", type=str, choices=[e.name for e in StrandnessReportingLevel],
help="reporting level for novel transcripts based on canonical splice sites;"
" default: " + StrandnessReportingLevel.auto.name,
default=StrandnessReportingLevel.only_stranded.name)
add_additional_option("--polya_requirement", type=str, choices=[e.name for e in PolyAUsageStrategies],
help="require polyA tails to be present when reporting transcripts; "
"default: auto (requires polyA only when polyA percentage is >= 70%%)",
default=PolyAUsageStrategies.auto.name)
# OUTPUT PROPERTIES
pipeline_args_group.add_argument("--threads", "-t", help="number of threads to use", type=int,
default="16")
pipeline_args_group.add_argument('--check_canonical', action='store_true', default=False,
help="report whether splice junctions are canonical")
add_additional_option_to_group(pipeline_args_group, "--report_unstranded",
help="report transcripts for which the strand cannot be detected "
"using canonical splice sites",
action='store_true', default=False)
pipeline_args_group.add_argument("--sqanti_output", help="produce SQANTI-like TSV output",
action='store_true', default=False)
pipeline_args_group.add_argument("--count_exons", help="perform exon and intron counting",
Expand All @@ -180,8 +185,6 @@ def add_hidden_option(*args, **kwargs): # show command only with --full-help
help="align reads to reference without running further analysis")

# ADDITIONAL
add_additional_option("--low_memory", help="decrease RAM consumption (deprecated, set by default)",
action='store_true', default=True)
add_additional_option("--high_memory", help="increase RAM consumption (store alignment and the genome in RAM)",
action='store_true', default=False)
add_additional_option("--no_junc_bed", action="store_true", default=False,
Expand Down Expand Up @@ -234,8 +237,12 @@ def add_hidden_option(*args, **kwargs): # show command only with --full-help
type=str, required=True)
resume_parser.add_argument('--debug', action='store_true', default=argparse.SUPPRESS,
help='Debug log output.')
resume_parser.add_argument("--threads", "-t", help="number of threads to use", type=int, default=argparse.SUPPRESS)
resume_parser.add_argument("--low_memory", help="decrease RAM consumption (deprecated, set by default)",
resume_parser.add_argument("--threads", "-t", help="number of threads to use",
type=int, default=argparse.SUPPRESS)
resume_parser.add_argument("--high_memory",
help="increase RAM consumption (store alignment and the genome in RAM)",
action='store_true', default=False)
resume_parser.add_argument("--keep_tmp", help="do not remove temporary files in the end",
action='store_true', default=argparse.SUPPRESS)
resume_parser.add_argument("--keep_tmp", help="do not remove temporary files in the end", action='store_true',
default=argparse.SUPPRESS)
Expand Down Expand Up @@ -271,11 +278,12 @@ def check_and_load_args(args, parser):
logger.warning("Output folder already contains a previous run, will be overwritten.")
else:
logger.warning("Output folder already contains a previous run, some files may be overwritten. "
"Use --resume to resume a failed run. Use --force to avoid this message. "
"Press Ctrl+C to interrupt the run now.")
"Use --resume to resume a failed run. Use --force to avoid this message.")
logger.warning("Press Ctrl+C to interrupt the run now.")
delay = 9
for i in range(delay):
sys.stdout.write("Resuming the run in %d seconds\r" % (delay - i))
countdown = delay - i
sys.stdout.write("Resuming the run in %d second%s\r" % (countdown, "s" if countdown > 1 else ""))
time.sleep(1)
logger.info("Overwriting the previous run")
time.sleep(1)
Expand All @@ -288,9 +296,6 @@ def check_and_load_args(args, parser):
elif not os.path.exists(args.genedb_output):
os.makedirs(args.genedb_output)

if args.high_memory:
args.low_memory = False

if not check_input_params(args):
parser.print_usage()
exit(-1)
Expand Down Expand Up @@ -501,7 +506,7 @@ def set_data_dependent_options(args):
args.splice_correction_strategy = splice_correction_strategies[args.data_type]

args.resolve_ambiguous = 'monoexon_and_fsm' if args.fl_data else 'default'
args.needs_polya_for_construction = False
args.requires_polya_for_construction = False
if args.read_group is None and args.input_data.has_replicas():
args.read_group = "file_name"
args.use_technical_replicas = args.read_group == "file_name"
Expand Down Expand Up @@ -591,16 +596,26 @@ def set_model_construction_options(args):
'min_known_count', 'min_nonfl_count',
'min_novel_count', 'min_novel_count_rel',
'min_mono_count_rel', 'singleton_adjacent_cov',
'fl_only', 'novel_monoexonic'))
'fl_only', 'novel_monoexonic',
'require_monointronic_polya', 'require_monoexonic_polya',
'report_canonical'))
strategies = {
'reliable': ModelConstructionStrategy(2, 0.5, 20, 5, 0.05, 1, 0.1, 0.1, 2, 4, 8, 0.05, 0.05, 50, True, False),
'default_pacbio': ModelConstructionStrategy(1, 0.5, 10, 2, 0.02, 1, 0.05, 0.05, 1, 2, 2, 0.02, 0.005, 100, False, True),
'sensitive_pacbio':ModelConstructionStrategy(1, 0.5, 5, 2, 0.005, 1, 0.01, 0.02, 1, 2, 2, 0.005, 0.001, 100, False, True),
'default_ont': ModelConstructionStrategy(1, 0.5, 20, 3, 0.02, 1, 0.05, 0.05, 1, 3, 3, 0.02, 0.02, 10, False, False),
'sensitive_ont': ModelConstructionStrategy(1, 0.5, 20, 3, 0.005, 1, 0.01, 0.02, 1, 2, 3, 0.005, 0.005, 10, False, True),
'fl_pacbio': ModelConstructionStrategy(1, 0.5, 10, 2, 0.02, 1, 0.05, 0.01, 1, 2, 3, 0.02, 0.005, 100, True, True),
'all': ModelConstructionStrategy(0, 0.3, 5, 1, 0.002, 1, 0.01, 0.01, 1, 1, 1, 0.002, 0.001, 500, False, True),
'assembly': ModelConstructionStrategy(0, 0.3, 5, 1, 0.05, 1, 0.01, 0.02, 1, 1, 1, 0.05, 0.01, 50, False, True)
'reliable': ModelConstructionStrategy(2, 0.5, 20, 5, 0.05, 1, 0.1, 0.1, 2, 4, 8, 0.05, 0.05, 50,
True, False, True, True, StrandnessReportingLevel.only_canonical),
'default_pacbio': ModelConstructionStrategy(1, 0.5, 10, 2, 0.02, 1, 0.05, 0.05, 1, 2, 2, 0.02, 0.005, 100,
False, True, False, True, StrandnessReportingLevel.only_canonical),
'sensitive_pacbio':ModelConstructionStrategy(1, 0.5, 5, 2, 0.005, 1, 0.01, 0.02, 1, 2, 2, 0.005, 0.001, 100,
False, True, False, False, StrandnessReportingLevel.only_stranded),
'default_ont': ModelConstructionStrategy(1, 0.5, 20, 3, 0.02, 1, 0.05, 0.05, 1, 3, 3, 0.02, 0.02, 10,
False, False, True, True, StrandnessReportingLevel.only_canonical),
'sensitive_ont': ModelConstructionStrategy(1, 0.5, 20, 3, 0.005, 1, 0.01, 0.02, 1, 2, 3, 0.005, 0.005, 10,
False, True, False, False, StrandnessReportingLevel.only_stranded),
'fl_pacbio': ModelConstructionStrategy(1, 0.5, 10, 2, 0.02, 1, 0.05, 0.01, 1, 2, 3, 0.02, 0.005, 100,
True, True, False, False, StrandnessReportingLevel.only_canonical),
'all': ModelConstructionStrategy(0, 0.3, 5, 1, 0.002, 1, 0.01, 0.01, 1, 1, 1, 0.002, 0.001, 500,
False, True, False, False, StrandnessReportingLevel.all),
'assembly': ModelConstructionStrategy(0, 0.3, 5, 1, 0.05, 1, 0.01, 0.02, 1, 1, 1, 0.05, 0.01, 50,
False, True, False, False, StrandnessReportingLevel.only_stranded)
}
strategy = strategies[args.model_construction_strategy]

Expand Down Expand Up @@ -633,6 +648,13 @@ def set_model_construction_options(args):
logger.info("Novel unspliced transcripts will not be reported, "
"set --report_novel_unspliced true to discover them")

args.require_monointronic_polya = strategy.require_monointronic_polya
args.require_monoexonic_polya = strategy.require_monoexonic_polya
args.polya_requirement_strategy = PolyAUsageStrategies[args.polya_requirement]
args.report_canonical_strategy = StrandnessReportingLevel[args.report_canonical]
if args.report_canonical_strategy == StrandnessReportingLevel.auto:
args.report_canonical_strategy = strategy.report_canonical


def set_configs_directory(args):
config_dir = os.path.join(os.environ['HOME'], '.config', 'IsoQuant')
Expand All @@ -656,8 +678,6 @@ def set_additional_params(args):
set_splice_correction_options(args)

args.print_additional_info = True
args.no_polya = False

args.indel_near_splice_site_dist = 10
args.upstream_region_len = 20

Expand All @@ -673,6 +693,8 @@ def set_additional_params(args):
args.needs_reference = False

args.simple_models_mapq_cutoff = 30
args.polya_percentage_threshold = 0.7
args.low_polya_percentage_threshold = 0.1


def run_pipeline(args):
Expand Down
4 changes: 2 additions & 2 deletions src/alignment_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ def __init__(self, chr_id, bam_pairs, params, illumina_bam, genedb=None, chr_rec

self.bam_merger = BAMOnlineMerger(self.bam_pairs, self.chr_id, 0,
self.bam_pairs[0][0].get_reference_length(self.chr_id),
multiple_iterators=self.params.low_memory)
multiple_iterators=not self.params.high_memory)
self.strand_detector = StrandDetector(self.chr_record)
self.read_groupper = read_groupper
self.polya_finder = PolyAFinder(self.params.polya_window, self.params.polya_fraction)
Expand All @@ -238,7 +238,7 @@ def __init__(self, chr_id, bam_pairs, params, illumina_bam, genedb=None, chr_rec

def process(self):
current_region = None
alignment_storage = BAMAlignmentStorage(self.bam_merger) if self.params.low_memory else InMemoryAlignmentStorage()
alignment_storage = BAMAlignmentStorage(self.bam_merger) if not self.params.high_memory else InMemoryAlignmentStorage()

for bam_index, alignment in self.bam_merger.get():
if alignment_storage.alignment_is_not_adjacent(alignment):
Expand Down
Loading
Loading