Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow LocatableXsvFuncotationFactory to read gzipped files #8363

Merged
merged 4 commits into from
Jul 10, 2023

Conversation

bbimber
Copy link
Contributor

@bbimber bbimber commented Jun 14, 2023

No description provided.

@bbimber
Copy link
Contributor Author

bbimber commented Jun 14, 2023

Hello,

@jonn-smith and @lbergelson: I'd like to make a custom data source using tabular data as the input. These data should be matched on position, not considering allele, using LocatableXSV as the type. There is an example of a TSV using Oreganno in the docs (https://gatk.broadinstitute.org/hc/en-us/articles/360035889931-Funcotator-Information-and-Tutorial).; however, I ran into two issues:

  • A TSV needs to be indexed to be read. It's not clear how to generate an idx file from a non-bgzipped tsv. GATK's IndexFieldFile does not recognize basic TSVs as input files. Perhaps I'm missing something.

  • It would be possible to gzip the TSV and make a tabix index. The problem is that while most GATK code seamlessly handles unzipped or gzipped inputs, LocatableXsvFuncotationFactory expected unzipped. This is a minor change to the file reading code that allows gzipped TSV inputs.

Below is an example input. You can bgzip this and index using:

tabix textSource.txt.gz -s 1 -b 2 -e 3 -S 1 -f

With this PR, I think funcotator will now support gzipped LocatableXSV sources. Would it be possible to add this?

textSource.txt

@jonn-smith
Copy link
Collaborator

@bbimber This looks like a good change, but I don't think it'll solve the problem.

The XsvLocatableTableCodec works differently than other codecs. It was essentially created for Funcotator datasources, and it expects to be given a .config file rather than the XSV file itself. For example, if you wanted to index the Oreganno data source file, you'd need to first create a funcotator configuration file adjacent to it, and then use IndexFeatureFile to index the config file rather than the tsv.

This is not a good design (my fault), but it's how the tool operates as of right now.

@bbimber
Copy link
Contributor Author

bbimber commented Jun 14, 2023

@jonn-smith, I did see XsvLocatableTableCodec and the .config file path, but this does not appear to work. To be clear this is something like:

gatk IndexFeatureFile -I ./hg19/testTextSource.config

In IndexFeatureFile (

return IndexFactory.createDynamicIndex(featurePath.toPath(), codec, IndexFactory.IndexBalanceApproach.FOR_SEEK_TIME);
), it does identify the correct codec; however, it then calls:

IndexFactory.createDynamicIndex(featurePath.toPath(), ...)

where featurePath is the config file. This calls IndexFactory to open a lineReader on the config file (not the backing data source): https://github.com/samtools/htsjdk/blob/6d3fc7bc1f613ecfce1c22d368f3ae17cb86823d/src/main/java/htsjdk/tribble/index/IndexFactory.java#L598.

This then fails during XsvLocatableTableCodec.readActualHeader(), since this is trying to read the config file, not the TXT file.

@bbimber
Copy link
Contributor Author

bbimber commented Jun 14, 2023

@jonn-smith also, I've tried this locally and I'm pretty sure the change I propose here allows Funcotator to use txt.gz LocatableXSV sources, where that file is just indexed separately with tabix

@bbimber
Copy link
Contributor Author

bbimber commented Jun 21, 2023

@jonn-smith: any thoughts on the PR? I could try to put in a test, but this does allow Funcotator to read gzipped TSV files, which is rather useful when you're dealing with large inputs.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems fine to me @bbimber, provided you add an integration test demonstrating the new gzip support in Funcotator.

@bbimber
Copy link
Contributor Author

bbimber commented Jun 22, 2023

ok

@bbimber
Copy link
Contributor Author

bbimber commented Jun 22, 2023

@droazen: test added (technically a unit test), but I dont think I'm able to kick off the test suite.

@bbimber
Copy link
Contributor Author

bbimber commented Jun 22, 2023

I added one unrelated bugfix. FuncotatorUtils.createReferenceSnippet tries to expand the reference window. When doing this, it should never allow a start less than 1. The last commit addresses that.

Note: I did not see an easy way for createReferenceSnippet() to identify the length of the contig (such as access to the SequenceDictionary), but it would in theory be useful to also check contig size and not exceed it.

@droazen or @jonn-smith: it would be helpful if you could approve the test run

22 Jun 2023 14:54:27,152 DEBUG: 	java.lang.IllegalArgumentException: Invalid interval. Contig:MT start:0 end:20
22 Jun 2023 14:54:27,154 DEBUG: 		at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:804)
22 Jun 2023 14:54:27,155 DEBUG: 		at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:59)
22 Jun 2023 14:54:27,156 DEBUG: 		at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:35)
22 Jun 2023 14:54:27,158 DEBUG: 		at org.broadinstitute.hellbender.tools.funcotator.FuncotatorUtils.createReferenceSnippet(FuncotatorUtils.java:1461)
22 Jun 2023 14:54:27,159 DEBUG: 		at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createIgrFuncotation(GencodeFuncotationFactory.java:2481)
22 Jun 2023 14:54:27,160 DEBUG: 		at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createIgrFuncotations(GencodeFuncotationFactory.java:2407)
22 Jun 2023 14:54:27,162 DEBUG: 		at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createDefaultFuncotationsOnVariant(GencodeFuncotationFactory.java:499)
22 Jun 2023 14:54:27,163 DEBUG: 		at org.broadinstitute.hellbender.tools.funcotator.DataSourceFuncotationFactory.createFuncotations(DataSourceFuncotationFactory.java:217)
22 Jun 2023 14:54:27,164 DEBUG: 		at org.broadinstitute.hellbender.tools.funcotator.DataSourceFuncotationFactory.createFuncotations(DataSourceFuncotationFactory.java:182)
22 Jun 2023 14:54:27,166 DEBUG: 		at org.broadinstitute.hellbender.tools.funcotator.FuncotatorEngine.lambda$createFuncotationMapForVariant$0(FuncotatorEngine.java:152)
22 Jun 2023 14:54:27,167 DEBUG: 		at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
22 Jun 2023 14:54:27,168 DEBUG: 		at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
22 Jun 2023 14:54:27,170 DEBUG: 		at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
22 Jun 2023 14:54:27,171 DEBUG: 		at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
22 Jun 2023 14:54:27,172 DEBUG: 		at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
22 Jun 2023 14:54:27,174 DEBUG: 		at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
22 Jun 2023 14:54:27,175 DEBUG: 		at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
22 Jun 2023 14:54:27,177 DEBUG: 		at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
22 Jun 2023 14:54:27,178 DEBUG: 		at org.broadinstitute.hellbender.tools.funcotator.FuncotatorEngine.createFuncotationMapForVariant(FuncotatorEngine.java:162)
22 Jun 2023 14:54:27,180 DEBUG: 		at com.github.discvrseq.walkers.ExtendedFuncotator.enqueueAndHandleVariant(ExtendedFuncotator.java:209)
22 Jun 2023 14:54:27,181 DEBUG: 		at org.broadinstitute.hellbender.tools.funcotator.Funcotator.apply(Funcotator.java:878)

@bbimber
Copy link
Contributor Author

bbimber commented Jun 26, 2023

@droazen or @jonn-smith: checking back: any chance someone would be able to approve the workflow so tests can run?

@jonn-smith
Copy link
Collaborator

@bbimber - Sorry for the delay approved and running.

@bbimber
Copy link
Contributor Author

bbimber commented Jun 26, 2023

@jonn-smith: thanks. looks like the tests passed. what do you think about the PR and level of testing?

@bbimber bbimber requested a review from droazen June 30, 2023 19:27
@bbimber
Copy link
Contributor Author

bbimber commented Jul 5, 2023

@droazen and/or @jonn-smith: is there anything else you need on this PR? It appears the test is passing

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbimber Back to you with a quick request

Object[][] arr1 = provideDataForTestCreateFuncotations(false);
Object[][] arr2 = provideDataForTestCreateFuncotations(true);
return Stream.concat(Arrays.stream(arr1), Arrays.stream(arr2)).
toArray(size -> (Object[][]) Array.newInstance(arr1.getClass().getComponentType(), arr1.length + arr2.length));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use the preexisting utility method Utils.concat(arr1, arr2, Object[][]::new) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

@droazen droazen self-assigned this Jul 7, 2023
@bbimber
Copy link
Contributor Author

bbimber commented Jul 7, 2023

@droazen: the change is added. it looks like this needs approval to kick off tests?

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Tests pass, merging this one

@droazen droazen merged commit 810326c into broadinstitute:master Jul 10, 2023
@bbimber
Copy link
Contributor Author

bbimber commented Jul 10, 2023

@droazen Thanks!

@bbimber bbimber deleted the locatableXSV branch July 10, 2023 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants