Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CBOR sequence file format #194

Closed
wants to merge 8 commits into from

Conversation

qiu-x
Copy link

@qiu-x qiu-x commented Oct 17, 2021

This adds support for CBOR sequence detection.
Official spec for reference:
https://www.rfc-editor.org/rfc/rfc8742.html
(Related to #172)

@qiu-x qiu-x force-pushed the master branch 2 times, most recently from 3c06b20 to e58afb8 Compare October 17, 2021 18:25
@qiu-x
Copy link
Author

qiu-x commented Oct 19, 2021

After merging this it should be easy to also support the regular CBOR format(#171) using the cborHelper function.

Copy link
Owner

@gabriel-vasile gabriel-vasile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @qiu-x
Can you please follow the cbor check algorithm more closely?

Also, consider changing functions like

// cborHelper mutates raw to point to the next cbor object
func cborHelper(raw *[]byte) bool

to

// cborHelper returns a new slice or raw containing the next cbor object
func cborHelper(raw []byte) (nextCbor []byte, ok bool)

With this change there won't be a need to dereference raw at every step.

internal/magic/binary.go Outdated Show resolved Hide resolved
internal/magic/binary.go Outdated Show resolved Hide resolved
@qiu-x
Copy link
Author

qiu-x commented Oct 25, 2021

I removed the pointer and added a offset counter. @gabriel-vasile please let me know if I should change anything else.

@qiu-x
Copy link
Author

qiu-x commented Oct 29, 2021

OK, after coming back to my code a few days later, everything still seems to look good now. If this gets merged I also have a pull request ready for regular CBOR support.

Copy link
Owner

@gabriel-vasile gabriel-vasile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, it looks better but there is a problem.
mimetype has this readLimit that prevents loading huge files into memory. Because of this, CBOR sequences bigger than 3072 bytes will not be detected.

We had this problem for CSV and NDJSON formats. The way it is solved for these formats is the last line of input is ignored because it could be truncated at readLimit index.
I think you can take the same approach here too. Ignore the last CBOR in sequence because it can be truncated.
References for NDJSON and CSV:

func NdJSON(raw []byte, limit uint32) bool {

func Csv(raw []byte, limit uint32) bool {

test case for large file NDJSON:

"ndjson.xl.ndjson": "application/x-ndjson",

Please let me know if something is unclear or if I can help you somehow.

@qiu-x
Copy link
Author

qiu-x commented Nov 2, 2021

I changed the code to ignore the last CBOR if the limit is reached. I also made the test larger to include this case.

Copy link
Owner

@gabriel-vasile gabriel-vasile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code runs now for bigger sequences, but I'm not sure I can follow the algorithm.
I see mt is compared to 0x40, 0x60, 0x80, 0xa0, 0xc0 at line 217. The specification uses 2, 3, 4, 5, 6, 7 for that check. Any particular reason for not using the same exact values they use in the specification?

return 0, false
}

mt := uint8(raw[offset] & 0xe0)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the specification mt is:

mt = ib >> 5;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used 0x40, 0x60, 0x80... because those are the raw byte values that CBOR uses. A bitwise AND with 0xe0 (11100000) has the same effect as making 5 right shifts (>> 5) - it discards the 5 last bits of the byte. Testing was a easier like this, but I will change it to 2, 3, 4... to avoid confusion.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, >> 5 is not the same as & 0xe0. https://play.golang.org/p/NFgj0z0Mt9Q

}
val = int(BgEn.Uint64(raw[offset : offset+8]))
offset += 8
case 31:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In specification:

    case 31:
      return well_formed_indefinite(mt, breakable);

Copy link
Author

@qiu-x qiu-x Nov 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am returning false here because for those values well_formed_indefinite would call fail(). But in my code cborIndefinite is directly running cborHelper to avoid duplication - this is possible because the mt switch in well_formed_indefinite and well_formed is almost the same. Because of this I exclude the cases that cborIndefinite should not handle.

}

switch mt {
case 0x40, 0x60:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In specification:

switch (mt) {
    // case 0, 1, 7 do not have content; just use val
    case 2: case 3: take(val); break; // bytes/UTF-8
    case 4: for (i = 0; i < val; i++) well_formed(); break;
    case 5: for (i = 0; i < val*2; i++) well_formed(); break;
    case 6: well_formed(); break;     // 1 embedded data item
    case 7: if (ai == 24 && val < 32) fail(); // bad simple
  }

return offset, true
}

func cborIndefinite(raw []byte, mt uint8, offset int) (int, bool) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this function is the equivalent of well_formed_indefinite from specification, but it does not look like it does the same thing.

Copy link
Author

@qiu-x qiu-x Nov 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be equivalent to the while loops in well_formed_indefinite, but it handles all the cases, since I already excluded the bad ones earlier in the code

Copy link
Owner

@gabriel-vasile gabriel-vasile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I'm insisting on following the specification exactly is because minor changes to the algorithm can introduce hard to find bugs. The specification is guaranteed to be correct and bug-free.

For example:

	magic.CborSeq([]byte("\xf80"), readLimit)

triggers a panic. This would not happen if CborSeq followed the specification exactly.

@qiu-x qiu-x closed this Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants