Fix for Identify() failing on empty and small files: #319

congop · 2022-01-29T16:05:34Z

Hi,

I had the following issue while identifying by content:

Identify("",bytes.NewReader([]byte{})) returns a fmt.wrapError of io.EOF
Identify("",bytes.NewReader([]byte{'a'})) returns a fmt.wrapError of io.ErrUnexpectedEOF

Version:

commit a44c8d2

Expected Outcome:

is archiver.ErrNoMatch (i.e. not a compressed stream nor an archive)

This commit provides a fix, by properly handling io.EOF and io.ErrUnexpected. Please check archiver.head().

I have also try to rationalize the usage of io.ReadFull() vs. io.Reader.Read(). Hopefully this match your original rationale or is one that you can agree on.

Best,

Patrice

@see

- Issue - Identify(,bytes.NewReader([]byte{})) returns an fmt.wrapError of io.EOF - Identify(,bytes.NewReader([]byte{'a'})) returns an fmt.wrapError of io.ErrUnexpectedEOF - the expected outcome is archiver.ErrNoMatch (i.e. not a compressed stream nor an archive) - Cause: lack of handling of io.EOF and io.ErrUnexpectedEOF outcomes of io.ReadFull() - Fix - consists in handling io.EOF and io.ErrUnexpectedEOF as cases of the stream not containing enough bytes - and returning the available bytes up to the requested size - @see archiver.head()

mholt · 2022-01-29T18:28:21Z

Hey, thanks, this looks very good, I appreciate the tests too. I will look at this more soon when I have a chance!

mholt

Ok, so first off, love the test cases. Thanks for writing those!

And you're right, this is a bug in each formats' Match() method. I didn't think about the case where the stream is empty or short, and indeed, any error is currently returned, but we should probably ignore EOF or UnexpectedEOF.

I am not sure we need a separate function in its own file (head()), for logic that's as simple as this:

if err != nil && !(err == io.ErrUnexpectedEOF || err == io.EOF) {
	return nil, err
}

How about instead, we simply change the offending if err != nil lines to:

if err != nil && !errors.Is(err, io.EOF) && !errors.Is(err, io.ErrUnexpectedEOF) {
	return err
}

(using errors.Is and distributing the ! is a theoretically robust and maybe easier to read)

Doing this should also simplify the now-awkward rar handling of this.

What do you think? Basically, just change the incorrect if statements instead of making a new function. (I do like that you used :n though when reading the buf slice; that's probably a good thing to keep too, even though I think ReadFull will always return n == len(buf) if there's no error.)

…der ends with 0.

congop · 2022-01-31T10:00:12Z

Hi Matt,

I am glad you loved them. I even added more to make some corner cases explicit.

** On extending <<if err != nil>> and keeping using io.ReadFull()
This will not always work. The reason is that io.ReadFull() does not return n=len(buf) in the case of EOF and UnexpectedEOF.
The only reason Identity() do not panic because of slice bounds out of range is because buf was pre-created with the right size.
Bytes beyond index n-1 are not set by ReadFull(). By returning n ReadFull()[actually tells us not to use them. We should honor that.
It may also lead to false positive because the default byte value happen to be equal to the suffix a known header.
Please checkout those corner cases in TestIdentifyDoesNotMatchContentFromTrimmedKnownHeaderHaving0Suffix.
That was the hard reason.
The soft one is about making one's intent explicit. The intent here is us being ok with streams having less than the requested number of bytes.
Hence the choice of the name <<head()>>. The same non trivial feature is also repeated 8 times. So having an extra function seems to be the better practice.

** On using errors.Is()
errors.Is() does more than checking mere equality. It is not need because io.ReadFull does intentfully returns EOF and UnexpectedEOF.
Therefore we do not need to care about error causal hierarchies.

** On distributing <!>
<<!(err == io.ErrUnexpectedEOF || err == io.EOF)>> should be read <<!errMeansLessThanMaxbytesAvailable>>
So distributing <<!>> meant splitting things that meaningfully belongs together.
I am not quite sure you got that intent but still want the distribution of <<!>> because of readability?
Here are alternative to you to choose from:

distribute <<!>> without introducing errors.Is()
if err != nil && err != io.ErrUnexpectedEOF && err != io.EOF {
return nil, err
}
Keep it as is
Going "full clean code" and being more explicit about intent
errMeansLessThanMaxBytesAvailable:=err == io.ErrUnexpectedEOF || err == io.EOF
if err != nil && !errMeansLessThanMaxBytesAvailable {
return nil, err
}
I originally refrained from the third alternative because It felt too much clean-code-isch.
Your remark though is may be a hint that it is the better way.

** On << now-awkward rar handling >>
I am afraid this is complexity we must live with because we are combining two checks.
I could make the intent more clear by using better naming and an extra variable:

basically we want head() result for v1.5
the code
...
availLengthBufV1_5 := len(rarHeaderV1_5)
if availLengthBufV1_5 > len(buf) {
// because there may not be enough bytes in the stream
availLengthBufV1_5 = len(buf)
}
headForV1_5 := buf[:availLengthBufV1_5]

mr.ByStream = bytes.Equal(headForV1_5, rarHeaderV1_5) || bytes.Equal(buf, rarHeaderV5_0)
no extra function(headOfByte()) here because this is used ones and the containing method Rar.Macht() is still small.

As alternative we could use a comparison between the length of buf and rarHeadV1_5 in computing mr.ByStream.
2 issues with that:

it is being too clever where we can afford being dump (just get the bytes available).
It goes against the algorithm uniformity and cohesion (get stream head, and compare it with a known header).
All that does not seem to be a better practice.

What do you think?

Patrice

mholt · 2022-02-02T04:07:08Z

Thanks for the detailed comment, Patrice, really appreciate it.

I'm a bit swamped with a few other work things this week, plus am getting married, so I will circle back to this when I have a chance to give it the attention it deserves!

mholt

I took the liberty of cleaning up some of the code for consistency with the existing code base (mainly comments and minor style adjustments). Otherwise I think this change is good. Thank you for contributing it and helping me understand it! I will go ahead and merge this.

congop · 2022-03-16T09:42:58Z

Hi Matt,

That is great.

Welcome back.
Congratulations on your marriage.
My wish for you and your wive/husband: you at your diamond anniversary with a huge family explaining to your great-grandchildren: back then there were no ai, we had to code ourselves. And they: wow!

mholt suggested changes Jan 30, 2022

View reviewed changes

Add tests to guard against false posivives in cases where a known hea…

98f22cc

…der ends with 0.

Some cleanup

087da42

mholt approved these changes Mar 14, 2022

View reviewed changes

mholt merged commit aa5354e into mholt:master Mar 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Identify() failing on empty and small files: #319

Fix for Identify() failing on empty and small files: #319

congop commented Jan 29, 2022

mholt commented Jan 29, 2022

mholt left a comment •

edited

Loading

congop commented Jan 31, 2022 •

edited

Loading

mholt commented Feb 2, 2022

mholt left a comment

congop commented Mar 16, 2022

Fix for Identify() failing on empty and small files: #319

Fix for Identify() failing on empty and small files: #319

Conversation

congop commented Jan 29, 2022

mholt commented Jan 29, 2022

mholt left a comment • edited Loading

Choose a reason for hiding this comment

congop commented Jan 31, 2022 • edited Loading

mholt commented Feb 2, 2022

mholt left a comment

Choose a reason for hiding this comment

congop commented Mar 16, 2022

mholt left a comment •

edited

Loading

congop commented Jan 31, 2022 •

edited

Loading