Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF_CSI_index_implementation #1684

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gokalpcelik
Copy link

@gokalpcelik gokalpcelik commented Sep 19, 2023

VCF-CSI index reading functionality for TabixReader.class

HTSJDK is unable to read CSI format VCF indexes unlike htslib and bcftools. Implementation/modifications were necessary to update HTSJDK and downstream tools to accept this index format for contigs larger than 2^29-1. CSIIndex class in HTSJDK was implemented for BAM style CSI index which is different from the VCF style CSI index. htslib already contains all the necessary modifications within the tbx.c. TabixReader.class already contains all the necessary code to read chunks from TBI format index files however CSI index requires reordering of byte reading steps and disabling linear index type as opposed to TBI format. Regions to bins (reg2bins) method also needs a new version to accomodate larger contig sizes and bin values. All changes were made in the original TabixReader.class to prevent additional rewiring of new index code back to VCFReader and AbstractFeatureReader classes.

Completed tasks:

  • byte[] CSI_MAGIC is added to distinguish tbi index from csi index.
  • A new reg2bins method is implemented to reflect the changes in the CSI format based on CSIv1 documentation with a few modifications.
  • min_shift, depth, loff_set variables were added to reflect the required changes for the CSI index format.
  • size of the int bins[] is updated to reflect the size needed for CSI index format.
  • isCsiIndex boolean is added to use as a switch to reorder byte reading.
  • Linear index is disabled for CSI index
  • VCFFileReader currently works and reads and performs regional queries csi indexed triticum vcf file with the constructor VCFFileReader(file)
  • Modify AbstractFeatureReader.class, TabixReader.class to accept autogeneration of ".csi" extension for vcf.gz files that does not have ".tbi" index.

Current To-do:

  • Documentation updates
  • Test updates
  • Comment updates
  • REVIEW!

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
  • Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

@LarsStegemanGT
Copy link

Hello,
What is the status of this PR?
This feature would be very helpful to us.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants