-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: Improve bcf Record filter interface and improve docs #306
Conversation
Pull Request Test Coverage Report for Build 1003750613
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
And I really like the idea of providing the filter names
, not the id
s. Making this library more user-friendly is definitely important!
One possible downside to these changes:
If you repeatedly apply the same filter function on lots of records in a loop, would repeatedly calling name_to_id()
result in a much performance hit compared to looking the id
s up once before starting the loop? Or do we expect cargo to clean this up for us during compilation? Maybe a small performance test could give an indication?
src/bcf/record.rs
Outdated
/// record.remove_filter(b"bar", true).unwrap(); | ||
/// assert!(!record.has_filter(b"bar")); | ||
/// assert!(record.has_filter(b"PASS")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add another example with pass_on_empty=false
?
src/bcf/record.rs
Outdated
/// # | ||
/// # // Write uncompressed VCF to stdout with above header and get an empty record | ||
/// # let mut vcf = Writer::from_stdout(&header, true, Format::VCF).unwrap(); | ||
/// # let mut record = vcf.empty_record(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this line visible in the docs? So that it is clear that record
is an empty_record()
to start with.
/// # let mut record = vcf.empty_record(); | |
/// let mut record = vcf.empty_record(); |
src/bcf/record.rs
Outdated
/// # Errors | ||
/// **Attention:** the returned [`BufferBacked`] from [`integer()`](Format::integer) | ||
/// (`read_depths`), which holds the data, has to be kept in scope as long as the data is | ||
/// accessed. If parts of the data are accessed after the `BufferBacked` object is been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo:
/// accessed. If parts of the data are accessed after the `BufferBacked` object is been | |
/// accessed. If parts of the data are accessed after the `BufferBacked` object has been |
src/bcf/record.rs
Outdated
/// | ||
/// # Errors | ||
/// **Attention:** the returned [`BufferBacked`] from [`integer()`](Format::integer) | ||
/// (`read_depths`), which holds the data, has to be kept in scope as long as the data is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does has to be kept in scope
mean for the user? Do I have to manually make sure, that it is in scope? Or will the lifetime of the resulting buffer do this for me and have cargo complain if this is not ensured?
src/bcf/record.rs
Outdated
assert!(!record.has_filter(b"foo")); | ||
assert!(record.has_filter(b"bar")); | ||
assert!(record.remove_filter(b"baz", true).is_err()); | ||
record.remove_filter(b"bar", true).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I'd also add the false
case.
src/bcf/record.rs
Outdated
/// - `flt_ids` - The identifiers of the filter values to set. | ||
pub fn set_filters(&mut self, flt_ids: &[Id]) { | ||
let mut flt_ids: Vec<i32> = flt_ids.iter().map(|x| **x as i32).collect(); | ||
pub fn set_filters(&mut self, flt_ids: &[&[u8]]) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename flt_ids
to something like flt_names
(as in name_to_id()
)? Although ID refers to the respective field where the filter name is given in the header, it also refers to the number assigned to that filter name internally. So maybe flt_names
is less ambiguous?
This is a very good point and I feel ashamed I didn't think of this. Benchmark of the implementation in this PR
Benchmark with previous implementation
So there is definitely a hit to performance for this convenience.
A link to the benchmark code can be found at https://github.com/mbhall88/rust-htslib/blob/3be94eb3a55cbd800d055e7fd733e47d1974f95d/benches/bcf.rs I guess this is not surprising, as we are adding an extra bit of retrieval to the code. htslib's name2id method basically just looks the name up in a hash table, so retrieval is pretty quick - hence the low drop in "real" performance. However, it is a drop nonetheless. There are a few of ways forward from here (that I see):
|
Thanks for so thoroughly setting up a benchmark! On the one hand, the time lost on 1 million records is not much, and BCFs will usually be limited to the millions of records (or maybe billions, but not more), simply by the size of most genomes. On the other hand, the relative increase in runtime seems to be to about 8x the original runtime, and that does seem like a bigger hit, if you handle lots of big BCFs. I like your list of options and with the above considerations, I would opt for 3---including doctests that explain the performance difference, reference each other across the two different trait implementations and recommend one time conversion of |
Ok, so I implemented a trait, Benchmark against original implementationThis benchmark used
So the trait has an overhead (on Benchmark with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have thought that the trait implementation shouldn't cause any overhead (e.g. see here). So maybe the devil is in the detail (of the implementation) somewhere?
@@ -79,6 +79,30 @@ impl NumericUtils for i32 { | |||
} | |||
} | |||
|
|||
/// A trait to allow for seamless use of bytes or integer identifiers for filters | |||
pub trait FilterId { | |||
fn id_from_header(&self, header: &HeaderView) -> Result<Id>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As none of the function implementations can return an error, I don't think wrapping this in a Result
is needed:
fn id_from_header(&self, header: &HeaderView) -> Result<Id>; | |
fn id_from_header(&self, header: &HeaderView) -> Id; |
Maybe this is what is causing the overhead for Id
? Because we only need the unpacking after using name_to_id()
, so maybe move the unpacking in to the implementation for [u8]
?
Ok(*self) | ||
} | ||
fn is_pass(&self) -> bool { | ||
*self == Id(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous code, this was simply checking *self == 0
(*flt_id == 0
) without the Id()
wrapping. Is this definitely needed, or could this be contributing to overhead, as well?
I'm off on holidays for three weeks so won't be able to dig into this for a while. Please feel free to try out some things on this PR branch if you like. |
Hmmm, I can't really find a way to remove the performance cost. I guess I'll just leave this PR here in case someone with better rust skills than I can find a way - or someone with more time to dedicate to it. |
To me the difference sounds quite marginal. Given the improved ergonomics, I think we can just keep it as it is and merge? Is there still anything to do here before merging? |
It's all good to go |
This is possibly a controversial change. I find having to interact with
Id
whenever trying to deal with bcf Record filters really counterintuitive and convoluted.For example, if I want to check whether a record has some filter
q10
I have to do something likeand this gets worse if I want to push a collection of filters as I have to get their internal Id from the header for each. This seems to be exposing internal things that needn't be exposed IMO.
The changes I am suggesting here instead allow you to do things like
which seems much more ergonomic?
I've also added thorough documentation for all touched methods - plus for the
format
method, which had no docs.Happhy to discuss and refine these proposed changes.