Skip to content
This repository has been archived by the owner on Nov 19, 2024. It is now read-only.

Fix for Identify() failing on empty and small files: #319

Merged
merged 3 commits into from
Mar 14, 2022

Conversation

congop
Copy link
Contributor

@congop congop commented Jan 29, 2022

Hi,

I had the following issue while identifying by content:

  • Identify("",bytes.NewReader([]byte{})) returns a fmt.wrapError of io.EOF
  • Identify("",bytes.NewReader([]byte{'a'})) returns a fmt.wrapError of io.ErrUnexpectedEOF

Version:

Expected Outcome:

  • is archiver.ErrNoMatch (i.e. not a compressed stream nor an archive)

This commit provides a fix, by properly handling io.EOF and io.ErrUnexpected. Please check archiver.head().

I have also try to rationalize the usage of io.ReadFull() vs. io.Reader.Read(). Hopefully this match your original rationale or is one that you can agree on.

Best,

Patrice

- Issue
  - Identify(,bytes.NewReader([]byte{})) returns an fmt.wrapError  of io.EOF
  - Identify(,bytes.NewReader([]byte{'a'})) returns an fmt.wrapError  of io.ErrUnexpectedEOF
  - the expected outcome is archiver.ErrNoMatch (i.e. not a compressed stream nor an archive)
- Cause: lack of handling of io.EOF and io.ErrUnexpectedEOF outcomes of io.ReadFull()
- Fix
  - consists in handling io.EOF and io.ErrUnexpectedEOF as cases of the stream not containing enough bytes
  - and returning  the available bytes up to the requested size
  - @see archiver.head()
@mholt
Copy link
Owner

mholt commented Jan 29, 2022

Hey, thanks, this looks very good, I appreciate the tests too. I will look at this more soon when I have a chance!

Copy link
Owner

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so first off, love the test cases. Thanks for writing those!

And you're right, this is a bug in each formats' Match() method. I didn't think about the case where the stream is empty or short, and indeed, any error is currently returned, but we should probably ignore EOF or UnexpectedEOF.

I am not sure we need a separate function in its own file (head()), for logic that's as simple as this:

if err != nil && !(err == io.ErrUnexpectedEOF || err == io.EOF) {
	return nil, err
}

How about instead, we simply change the offending if err != nil lines to:

if err != nil && !errors.Is(err, io.EOF) && !errors.Is(err, io.ErrUnexpectedEOF) {
	return err
}

(using errors.Is and distributing the ! is a theoretically robust and maybe easier to read)

Doing this should also simplify the now-awkward rar handling of this.

What do you think? Basically, just change the incorrect if statements instead of making a new function. (I do like that you used :n though when reading the buf slice; that's probably a good thing to keep too, even though I think ReadFull will always return n == len(buf) if there's no error.)

@congop
Copy link
Contributor Author

congop commented Jan 31, 2022

Hi Matt,

I am glad you loved them. I even added more to make some corner cases explicit.

** On extending <<if err != nil>> and keeping using io.ReadFull()
This will not always work. The reason is that io.ReadFull() does not return n=len(buf) in the case of EOF and UnexpectedEOF.
The only reason Identity() do not panic because of slice bounds out of range is because buf was pre-created with the right size.
Bytes beyond index n-1 are not set by ReadFull(). By returning n ReadFull()[actually tells us not to use them. We should honor that.
It may also lead to false positive because the default byte value happen to be equal to the suffix a known header.
Please checkout those corner cases in TestIdentifyDoesNotMatchContentFromTrimmedKnownHeaderHaving0Suffix.
That was the hard reason.
The soft one is about making one's intent explicit. The intent here is us being ok with streams having less than the requested number of bytes.
Hence the choice of the name <<head()>>. The same non trivial feature is also repeated 8 times. So having an extra function seems to be the better practice.

** On using errors.Is()
errors.Is() does more than checking mere equality. It is not need because io.ReadFull does intentfully returns EOF and UnexpectedEOF.
Therefore we do not need to care about error causal hierarchies.

** On distributing <!>
<<!(err == io.ErrUnexpectedEOF || err == io.EOF)>> should be read <<!errMeansLessThanMaxbytesAvailable>>
So distributing <<!>> meant splitting things that meaningfully belongs together.
I am not quite sure you got that intent but still want the distribution of <<!>> because of readability?
Here are alternative to you to choose from:

  • distribute <<!>> without introducing errors.Is()
    if err != nil && err != io.ErrUnexpectedEOF && err != io.EOF {
    return nil, err
    }
  • Keep it as is
  • Going "full clean code" and being more explicit about intent
    errMeansLessThanMaxBytesAvailable:=err == io.ErrUnexpectedEOF || err == io.EOF
    if err != nil && !errMeansLessThanMaxBytesAvailable {
    return nil, err
    }
    I originally refrained from the third alternative because It felt too much clean-code-isch.
    Your remark though is may be a hint that it is the better way.

** On << now-awkward rar handling >>
I am afraid this is complexity we must live with because we are combining two checks.
I could make the intent more clear by using better naming and an extra variable:

  • basically we want head() result for v1.5

  • the code
    ...
    availLengthBufV1_5 := len(rarHeaderV1_5)
    if availLengthBufV1_5 > len(buf) {
    // because there may not be enough bytes in the stream
    availLengthBufV1_5 = len(buf)
    }
    headForV1_5 := buf[:availLengthBufV1_5]

    mr.ByStream = bytes.Equal(headForV1_5, rarHeaderV1_5) || bytes.Equal(buf, rarHeaderV5_0)

  • no extra function(headOfByte()) here because this is used ones and the containing method Rar.Macht() is still small.

As alternative we could use a comparison between the length of buf and rarHeadV1_5 in computing mr.ByStream.
2 issues with that:

  • it is being too clever where we can afford being dump (just get the bytes available).
  • It goes against the algorithm uniformity and cohesion (get stream head, and compare it with a known header).
    All that does not seem to be a better practice.

What do you think?

Patrice

@mholt
Copy link
Owner

mholt commented Feb 2, 2022

Thanks for the detailed comment, Patrice, really appreciate it.

I'm a bit swamped with a few other work things this week, plus am getting married, so I will circle back to this when I have a chance to give it the attention it deserves!

Copy link
Owner

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of cleaning up some of the code for consistency with the existing code base (mainly comments and minor style adjustments). Otherwise I think this change is good. Thank you for contributing it and helping me understand it! I will go ahead and merge this.

@mholt mholt merged commit aa5354e into mholt:master Mar 14, 2022
@congop
Copy link
Contributor Author

congop commented Mar 16, 2022

Hi Matt,

That is great.

Welcome back.
Congratulations on your marriage.
My wish for you and your wive/husband: you at your diamond anniversary with a huge family explaining to your great-grandchildren: back then there were no ai, we had to code ourselves. And they: wow!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants