Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tool for producing genomic regions (as a BED file) #7159

Open
LeeTL1220 opened this issue Mar 24, 2021 · 1 comment
Open

Create tool for producing genomic regions (as a BED file) #7159

LeeTL1220 opened this issue Mar 24, 2021 · 1 comment
Assignees
Labels
learn GATK Suitable for GATK beginners

Comments

@LeeTL1220
Copy link
Contributor

LeeTL1220 commented Mar 24, 2021

Feature request

Tool(s) or class(es) involved

This is a request for a new tool
GencodeRegionsAsBED

Description

Given a GENCODE gtf, create a BED file with the region of the genes. Each row is a gene.

Suggestion: This can be implemented as a FeatureWalker<GencodeGtfFeature>

Requirements

  • [P0] Union all basic, coding transcripts to determine region. "basic" is a tag, defined by GENCODE, that appears on transcripts in the gtf.
  • [P0] Include option to separate each row by the transcript, as well. I.e. Each row is a transcript. Please include gene and transcript id in the output BED. Transcript entries should be sorted in natural order (in this case, natural order and alphabetical order will be the same).
  • [P0] Must support GENCODE v35 and above (through the latest at the time of the implementation)
  • [P0] Supports hg38 (note that this is implicit in the GENCODE version)
  • [P2] Include option that will create the BED file based on both basic and non-basic transcripts
  • [P2] Include option that will create the BED file based on both coding and non-coding transcripts
  • [P2] Include option to break out exon vs intron vs UTR, etc.
  • [P2] Support hg19/b37, which means supporting earlier versions of GENCODE.

[P0] = "Must have. Cannot close this issue without this feature or without filing another issue. This tool is not considered complete without this feature."
[P2] = "Not required. This tool can be considered complete without this feature. No need to ask permission to drop it. If it is NOT delivered, please mention what P2's were not delivered in the closing comment of this issue."

Example output

BED is tab-delimited...

...
chr22	21759657	21867680	MAPK1
...

With transcript option:

...
chr22	21759657	21867645	MAPK1,ENST00000215832.11
chr22	21769040	21867680	MAPK1,ENST00000398822.7
chr22	21769204	21867440	MAPK1,ENST00000544786.1
...

Note: The union of the transcript regions is reported when the transcript option is not present.

@LeeTL1220 LeeTL1220 added the learn GATK Suitable for GATK beginners label Mar 24, 2021
sanashah007 added a commit that referenced this issue Aug 29, 2024
#8942)

* Initial commit and basic code to read gtf

* add: code to write to bed & integration test

* fix: make getAllFeatures public and use the nesting of features to get to transcripts

* add: filtering transcripts by basic tag

* add: sorts by contig and start (need to fix - sorting lexicographically)

* fix: now sorts by contig then start & output is correct

* fix: make dictionary an arg

* add: comments + simplified CompareGtfInfo

* refactor: apply method
test: add separate tests for gene and transcript

* refactor: onTraversalSuccess and writeToBed

* add: more tests

* fix: test files in correct dir pt1. (files are too large)

* fix: test files in correct dir pt2.

* add: compareFiles and ground truth bed files

* fix: runGtfToBed assert

* add: comments to GtfToBed

* fix: error handling for different versions of gtf and dictionary

* fix: edited some bad conventions

* fix: remove spaces from input file fullName

* add: gtf file with MYT1L and MAPK1

* add: many transcripts unit test and refactoring

* add: tiebreaker sorting by id

* add: make sort by basic optional

* add: html doc comment

* fix: dictionary arg

* fix: add "Gencode" to description

* add: sample mouse gencode testing

* fix: Remove arg shortnames

* fix: rename and move CompareGtfInfo

* fix: kebab-case args

* fix: update html doc

* fix: use IntegrationTestSpec.assertEqualTextFiles()

* fix: remove unnecessary test of pik3ca

* fix: remove set functions in GtfInfo

* fix: style of comparator

* fix: style of comparator

* fix: use Files.newOutputStream() to write and logger for errors

* fix: use getBestAvailableSequenceDictionary()

* fix: use dataProvider for integration tests

* fix: better encapsulation

* fix: move mapk1.gtf to large dir

* fix: arg names

* fix: rename reference dict.

* fix: sequence-dictionary arg javadoc

* add: javadoc to GtfInfo

* add: dictionary exception and corresponding test

* add: test with fasta file as reference arg

* add: javadoc for fasta file

* fix: javadoc and onTraversalStart exception
@kockan
Copy link

kockan commented Aug 29, 2024

Resolved by #8942 . Unless @LeeTL1220 or @droazen has any objections I will close this with the note that all P0 requirements are met by the relevant PR. For the P2 requirements, the following are not included as of now but if they are deemed important I can keep this open:

  • [P2] Include option that will create the BED file based on both coding and non-coding transcripts
  • [P2] Include option to break out exon vs intron vs UTR, etc.

Thanks to @sanashah007 for all the work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
learn GATK Suitable for GATK beginners
Projects
None yet
Development

No branches or pull requests

4 participants