Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the OverlapDetector to be generic over input feature types #34

Merged
merged 18 commits into from
Aug 2, 2024

Conversation

clintval
Copy link
Member

@clintval clintval commented Jun 27, 2024

I'm interested in seeing what others think given that the OverlapDetector isn't an ergonomic utility to use, especially for custom interval-types which I may often choose to implement myself, or have received from a third-party package/API.

Goals

  1. The ergonomics of the OverlapDetector could be improved to store and query generic intervals.
  2. This PR should introduce the fewest backwards incompatibilities

Example with Existing API

The following demonstrates using a custom interval type with the OverlapDetector:

@dataclass(eq=True, frozen=True)
class CustomPrimer:
    """A custom primer feature in the shape of a BED3+1 record."""

    refname: str
    start: int
    end: int
    primary_key: str

panel: list[CustomPrimer] = [
    CustomPrimer("chr1", 1, 4, "primer1"),
    CustomPrimer("chr1", 5, 9, "primer2"),
]

detector = OverlapDetector()
for primer in panel:
    detector.add(Interval(primer.chrom, primer.start, primer.end, name=primer.primary_key))

primer_lookup: dict[str, CustomPrimer] = {primer.primary_key: primer for primer in panel}
overlapping_intervals: list[Interval] = detector.get_overlaps(Interval("chr1", 2, 3))
overlapping_primers: list[CustomPrimer] = [primer_lookup[interval.name] for interval in overlapping_intervals]

What makes this challenging to use is:

  1. The OverlapDetector only supports storing Interval objects
  2. The OverlapDetector can only be queried with Interval objects
  3. This library's primary data structure BedRecord doesn't work with OverlapDetector
  4. In order to convert back to your data of interest, you must knowingly store a primary key in the name field of Interval and perform a dictionary lookup (as shown above). If you don't have a primary key, you will need to make one.
  5. Any intervals must be added separately from class instantiation, often adding 2 LOC.

Example with this PR

@dataclass(eq=True, frozen=True)
class CustomPrimer:
    """A custom primer feature in the shape of a BED3+1 record."""

    refname: str
    start: int
    end: int
    primary_key: str

panel: list[CustomPrimer] = [
    CustomPrimer("chr1", 1, 4, "primer1"),
    CustomPrimer("chr1", 5, 9, "primer2"),
]

detector: OverlapDetector[CustomPrimer] = OverlapDetector(panel)
overlapping_primers: list[CustomPrimer] = detector.get_overlaps(Interval("chr1", 2, 3))

NB: although we are using an Interval to query the OverlapDetector, the type system is aware that we built the OverlapDetector with CustomPrimer types and all overlap functions will always return CustomPrimer objects. A user could alternatively query the overlap detector with any other interval-like feature but still receive CustomerPrimer objects which overlap.

NB: the BED spec notation BED3+1 is described here

@clintval clintval added the contentious-stuff 🔥🔥🔥 label Jun 27, 2024
@clintval clintval requested review from msto, nh13 and tfenne June 27, 2024 06:02
pybedlite/overlap_detector.py Show resolved Hide resolved
pybedlite/overlap_detector.py Outdated Show resolved Hide resolved
pybedlite/overlap_detector.py Show resolved Hide resolved
Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about performance of this PR. See #6 for an example of how we may want to test this.

pybedlite/overlap_detector.py Outdated Show resolved Hide resolved
@clintval
Copy link
Member Author

I worry about performance of this PR. See #6 for an example of how we may want to test this.

I'll find a way to test that! I don't think there will be any performance issues using the OverlapDetector but it is possible building one will be a little slower since you need to check each object you add for a reference name property.

@nh13
Copy link
Member

nh13 commented Jun 27, 2024

@clintval that's the case I'm worried about. I wonder if we can avoid it somehow.

@tfenne
Copy link
Member

tfenne commented Jun 27, 2024

@clintval I skimmed the PR briefly and wanted to suggest an alternative to see you have considered it and discarded it. The OverlapDetector is named for the same construct from HTSJDK ... and that version does things slightly differently. Beyond the fact that it can take any implementation of Locatable as the "Interval" type, it also decouples the intervals/spans from the returned element.

At it's simplest you can think of it as:

overlap_detector.add(interval, primer)
primers = overlap_detector.get_overlaps(interval)

I.e. for every interval you associate the object you'd like returned when a query overlaps that interval. It fixes some, but not all, of your issues I think - primarily that you don't need to keep an external mapping of interval back to your source type.

@clintval
Copy link
Member Author

@clintval I skimmed the PR briefly and wanted to suggest an alternative to see you have considered it and discarded it. The OverlapDetector is named for the same construct from HTSJDK ... and that version does things slightly differently. Beyond the fact that it can take any implementation of Locatable as the "Interval" type, it also decouples the intervals/spans from the returned element.

At it's simplest you can think of it as:

overlap_detector.add(interval, primer)
primers = overlap_detector.get_overlaps(interval)

I.e. for every interval you associate the object you'd like returned when a query overlaps that interval. It fixes some, but not all, of your issues I think - primarily that you don't need to keep an external mapping of interval back to your source type.

@tfenne I thought about that too but didn't give it a lot of thought because for large interval collections it seems it could ~2x your memory footprint by having a sibling object for every object you actually want to query. Does that seem right?

@tfenne
Copy link
Member

tfenne commented Jun 28, 2024

I think that's right @clintval - more memory usage for that solution vs. likely more CPU for having to indirect function calls.

@nh13
Copy link
Member

nh13 commented Jul 13, 2024

@clintval another solution is to sub-class Interval and add an attribute that you'd like to carry with that interval. This would not require an "external map", as well as not bloat memory for folks that just want to store intervals. I have successfully used this pattern elsewhere.

@clintval
Copy link
Member Author

@clintval another solution is to sub-class Interval and add an attribute that you'd like to carry with that interval. This would not require an "external map", as well as not bloat memory for folks that just want to store intervals. I have successfully used this pattern elsewhere.

I like it! That is a good idea for how to use the existing classes and functions without the need for creating a dictionary. It would work well for custom genomic features. It would still require making an interval companion object for every genomic feature and wouldn't work for 3rd-party genomic features (where I cannot define the class hierarchy), but it's a little more elegant than what I've been doing.

pybedlite/overlap_detector.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Aug 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.25%. Comparing base (592fb8e) to head (eeda85e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #34      +/-   ##
==========================================
+ Coverage   91.84%   95.25%   +3.40%     
==========================================
  Files           8        8              
  Lines         552      674     +122     
  Branches       97      119      +22     
==========================================
+ Hits          507      642     +135     
+ Misses         28       18      -10     
+ Partials       17       14       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@clintval clintval merged commit b449126 into main Aug 2, 2024
6 checks passed
@clintval clintval deleted the cv_generic_overlap_detector branch August 2, 2024 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants