Create tool for producing genomic regions (as a BED file) #7159

LeeTL1220 · 2021-03-24T15:03:26Z

Feature request

Tool(s) or class(es) involved

This is a request for a new tool
GencodeRegionsAsBED

Description

Given a GENCODE gtf, create a BED file with the region of the genes. Each row is a gene.

Suggestion: This can be implemented as a FeatureWalker<GencodeGtfFeature>

Requirements

[P0] Union all basic, coding transcripts to determine region. "basic" is a tag, defined by GENCODE, that appears on transcripts in the gtf.
[P0] Include option to separate each row by the transcript, as well. I.e. Each row is a transcript. Please include gene and transcript id in the output BED. Transcript entries should be sorted in natural order (in this case, natural order and alphabetical order will be the same).
[P0] Must support GENCODE v35 and above (through the latest at the time of the implementation)
[P0] Supports hg38 (note that this is implicit in the GENCODE version)
[P2] Include option that will create the BED file based on both basic and non-basic transcripts
[P2] Include option that will create the BED file based on both coding and non-coding transcripts
[P2] Include option to break out exon vs intron vs UTR, etc.
[P2] Support hg19/b37, which means supporting earlier versions of GENCODE.

[P0] = "Must have. Cannot close this issue without this feature or without filing another issue. This tool is not considered complete without this feature."
[P2] = "Not required. This tool can be considered complete without this feature. No need to ask permission to drop it. If it is NOT delivered, please mention what P2's were not delivered in the closing comment of this issue."

Example output

BED is tab-delimited...

...
chr22	21759657	21867680	MAPK1
...

With transcript option:

...
chr22	21759657	21867645	MAPK1,ENST00000215832.11
chr22	21769040	21867680	MAPK1,ENST00000398822.7
chr22	21769204	21867440	MAPK1,ENST00000544786.1
...

Note: The union of the transcript regions is reported when the transcript option is not present.

The text was updated successfully, but these errors were encountered:

#8942) * Initial commit and basic code to read gtf * add: code to write to bed & integration test * fix: make getAllFeatures public and use the nesting of features to get to transcripts * add: filtering transcripts by basic tag * add: sorts by contig and start (need to fix - sorting lexicographically) * fix: now sorts by contig then start & output is correct * fix: make dictionary an arg * add: comments + simplified CompareGtfInfo * refactor: apply method test: add separate tests for gene and transcript * refactor: onTraversalSuccess and writeToBed * add: more tests * fix: test files in correct dir pt1. (files are too large) * fix: test files in correct dir pt2. * add: compareFiles and ground truth bed files * fix: runGtfToBed assert * add: comments to GtfToBed * fix: error handling for different versions of gtf and dictionary * fix: edited some bad conventions * fix: remove spaces from input file fullName * add: gtf file with MYT1L and MAPK1 * add: many transcripts unit test and refactoring * add: tiebreaker sorting by id * add: make sort by basic optional * add: html doc comment * fix: dictionary arg * fix: add "Gencode" to description * add: sample mouse gencode testing * fix: Remove arg shortnames * fix: rename and move CompareGtfInfo * fix: kebab-case args * fix: update html doc * fix: use IntegrationTestSpec.assertEqualTextFiles() * fix: remove unnecessary test of pik3ca * fix: remove set functions in GtfInfo * fix: style of comparator * fix: style of comparator * fix: use Files.newOutputStream() to write and logger for errors * fix: use getBestAvailableSequenceDictionary() * fix: use dataProvider for integration tests * fix: better encapsulation * fix: move mapk1.gtf to large dir * fix: arg names * fix: rename reference dict. * fix: sequence-dictionary arg javadoc * add: javadoc to GtfInfo * add: dictionary exception and corresponding test * add: test with fasta file as reference arg * add: javadoc for fasta file * fix: javadoc and onTraversalStart exception

kockan · 2024-08-29T20:21:41Z

Resolved by #8942 . Unless @LeeTL1220 or @droazen has any objections I will close this with the note that all P0 requirements are met by the relevant PR. For the P2 requirements, the following are not included as of now but if they are deemed important I can keep this open:

[P2] Include option that will create the BED file based on both coding and non-coding transcripts
[P2] Include option to break out exon vs intron vs UTR, etc.

Thanks to @sanashah007 for all the work!

LeeTL1220 added the learn GATK Suitable for GATK beginners label Mar 24, 2021

LeeTL1220 assigned kishorikonwar Mar 24, 2021

LeeTL1220 mentioned this issue Mar 24, 2021

New Tool: Reference Comparator #6837

Open

droazen assigned LeeTL1220 Apr 12, 2021

kockan assigned sanashah007 Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create tool for producing genomic regions (as a BED file) #7159

Create tool for producing genomic regions (as a BED file) #7159

LeeTL1220 commented Mar 24, 2021 •

edited

Loading

kockan commented Aug 29, 2024 •

edited

Loading

Create tool for producing genomic regions (as a BED file) #7159

Create tool for producing genomic regions (as a BED file) #7159

Comments

LeeTL1220 commented Mar 24, 2021 • edited Loading

Feature request

Tool(s) or class(es) involved

Description

Requirements

Example output

kockan commented Aug 29, 2024 • edited Loading

LeeTL1220 commented Mar 24, 2021 •

edited

Loading

kockan commented Aug 29, 2024 •

edited

Loading