Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various spec violations #24

Open
jakobnissen opened this issue Jul 28, 2020 · 3 comments
Open

Various spec violations #24

jakobnissen opened this issue Jul 28, 2020 · 3 comments

Comments

@jakobnissen
Copy link
Member

Dear @CiaranOMara

I've been working a bit with XAM.jl. Great package, but I've found quite a number of small errors and mistakes I think should be fixed. Making one PR for each would be overwhelming, especially since fixing some of them has implications that are not straightforward. So I'd like to discuss it a bit here:

List of errors/inconsistencies

  • Missing qualities are set to 0xff, but hasquality does not check this, so the result may be wrong
  • refname should be accessible even if the read is unmapped (per specs section 1.4.3)
  • isnextmapped checks if the record is filled, but ismapped doesn't
  • hasauxdata doesn't even check if the record has aux data
  • quality(record) raises an obscure error if a quality is > 127
  • function MetaInfo should check that a tag has two characters per the specs
  • ismapped, isprimary and ispositive strand errors when called on an unfilled record, BUT isnextmapped doesn't
  • hasrefid, hasrefname and hasposition just returns if the record is filled, not what is actually asked. But hasrightposition behaves differently. Worse, refname behaves differently too, meaning you could have hasrefname() == true, yet it still errors when the refname is fetched.

Proposals/enhancements

  • We should just get rid of unfilled records, they make little sense as a concept. Instead, we initialize every record as an empty, but valid record. This way, we never need to check if the record is filled, since it always is.
  • Then we can remove e.g. hasflag since every record always have flags
  • all hasX function should return a Bool
  • if hasX function returns false, the accessor (X) function should error.
  • Rewrite the memory layout of the BAM record so it exactly matches the specifications. This will not result in any performance issues, just make the underlying code simpler and make it easier to interface with C in the future

I've implemented many of these changes at https://github.com/jakobnissen/XAM.jl, but that "fork" is stale, and I'd rather merge the fixes into the original XAM.jl

@CiaranOMara
Copy link
Member

CiaranOMara commented Jul 28, 2020

Making one PR for each would be overwhelming, especially since fixing some of them has implications that are not straightforward.

Let's split this up then.

Hotfix PRs addressing errors take priority. Hotfix PRs would be against the master branch and would embrace the current design flaws. Also, PRs addressing inconsistencies could be hotfixes too provided they don't alter the existing public API.

My preference is to address all known errors before making breaking changes (v0.3).

We should just get rid of unfilled records, they make little sense as a concept. Instead, we initialise every record as an empty, but valid record. This way, we never need to check if the record is filled, since it always is.

It'd be great to get rid of that isfilled check. The balance in question here is whether to use try catch, some other truthy toggle, or a use pattern that ensures the record is valid or that there was a conscious decision to use the record regardless.

The direction I was thinking is a two-stage approach, which would facilitate error recovery (https://github.com/BioJulia/XAM.jl/projects/4#card-42156212). The first stage would fill the record by copying in the current line. Then the second stage would attempt to index the copied data then return a truthy value.
With this approach, the file buffers and pointers will remain intact to process the next line if encountering an error or inconsistency while indexing the copied data. The state machine will greatly simplify as it would separate into three machines, the header-line, record-line and line structure.

I'm not sure what the performance implication would be with the two-stage approach as it parses the line twice? I'd be keen to hear your ideas about error recovery.

Rewrite the memory layout of the BAM record so it exactly matches the specifications.

I agree with rewriting the memory layout -- that's been bugging me. I also agree with your proposed way of consolidating the accessor and has functions. I think we should use BioGenerics.Exceptions.MissingFieldException.

PRs with developmental or breaking changes would be against the develop branch.

@jakobnissen
Copy link
Member Author

My preference is to address all known errors before making breaking changes (v0.3).

Right, makes sense. I'll begin with that

The balance in question here is whether to use try catch, some other truthy toggle, or a use pattern that ensures the record is valid or that there was a conscious decision to use the record regardless.

I think we should enforce specification-compliant BAM/SAM files, in the sense that, if the user provides a noncompliant file, we ought to error. Silent errors are the worst, especially in a scientific setting. I'm not sure there's a lot of sense in allowing bad records to be used.

In particular, I have rather strong opinions about #23 - this is an error in HiSat2, and should be fixed there. If we begin correcting errors created by bad writers, we have a neverending problem on our hands.

Nonetheless, having the internal implementation be first decompressing the data, then copying a single record to a Vector{UInt8} buffer, then using that buffer to construct a valid BAM Record, is a good idea. I think that's also how it works right now, isn't it? If a user wants to use an invalid record (again, which I don't think is something we should design for allowing), they may add a new method that creates a Record from the vector. The performance will be fine.

I'm quite ambivalent about how to handle missing data. I think there are two good options:

  1. Return nothing
  2. Raise a BioGenerics.Exceptions.MissingFieldException.
    I think returning a missing value is NOT a good idea, because missing is largely useless when processing SAM/BAM files, and may lead to unexpected errors. Probably a MissingFieldException is the way to go.

@CiaranOMara
Copy link
Member

I think we should enforce specification-compliant BAM/SAM files, in the sense that, if the user provides a noncompliant file, we ought to error.

I mostly agree. I think as a default we should expect specification-compliant BAM/SAM files and error when there is non-compliance. But, I think we only need to enforce this expectation at the line-level, not holistically, so that when it comes to iteration, users can make a judgement call and opt to skip errors.

Silent errors are the worst, especially in a scientific setting.

I agree we should error.

I'm not sure there's a lot of sense in allowing bad records to be used.

Yes, a poor choice of words on my part. The record would probably be unusable as well as indicative of other issues relating to the quality of the file.

What I was thinking is that the data should be accessible for diagnosis and easy to acquire. For example, after an error, we should be able to seek position, then read erroneous data into Record or MetaInfo.

In particular, I have rather strong opinions about #23 - this is an error in HiSat2, and should be fixed there.

I agree with that too and have no intention of merging the workaround. For me, the issue was that XAM in its current state, was not able to work with or around an error that was judged trivial.

Nonetheless, having the internal implementation be first decompressing the data, then copying a single record to a Vector{UInt8} buffer, then using that buffer to construct a valid BAM Record, is a good idea. I think that's also how it works right now, isn't it?

In the current release, and certainly with SAM, indexing of the data occurs as the line/record is determined. In a sense, the record already has the indexes before the raw data gets copied into it. Also, we do not explicitly check successful indexing; we assume indexing was successful if the record fills without error. I'd need to look at BAM again to see what it does with records, but the parsing of the BAM header is the same as SAM.

In the context of skipping errors, was proposing to have both index-then-fill and fill-then-index schemes. Though perhaps a cleaner approach would be to have Automa seek the end of the line/record if it falls into an invalid state and then returns a flag for error handling whether that be a throw or skip. With this approach, the pointer appropriately positions for the next line/record, and there is only a single pass over the data.

I'm quite ambivalent about how to handle missing data.

I'd forgotten about Missing. When we last spoke about it, I thought that missing could provide an additional statistic as well as neaten the accessors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants