Add support for CBOR sequence file format #194

qiu-x · 2021-10-17T16:52:15Z

This adds support for CBOR sequence detection.
Official spec for reference:
https://www.rfc-editor.org/rfc/rfc8742.html
(Related to #172)

qiu-x · 2021-10-19T15:28:14Z

After merging this it should be easy to also support the regular CBOR format(#171) using the cborHelper function.

gabriel-vasile

Hi, @qiu-x
Can you please follow the cbor check algorithm more closely?

Also, consider changing functions like

// cborHelper mutates raw to point to the next cbor object
func cborHelper(raw *[]byte) bool

to

// cborHelper returns a new slice or raw containing the next cbor object
func cborHelper(raw []byte) (nextCbor []byte, ok bool)

With this change there won't be a need to dereference raw at every step.

internal/magic/binary.go

…ore closely

qiu-x · 2021-10-25T15:51:42Z

I removed the pointer and added a offset counter. @gabriel-vasile please let me know if I should change anything else.

qiu-x · 2021-10-29T21:16:04Z

OK, after coming back to my code a few days later, everything still seems to look good now. If this gets merged I also have a pull request ready for regular CBOR support.

gabriel-vasile

Hi, it looks better but there is a problem.
mimetype has this readLimit that prevents loading huge files into memory. Because of this, CBOR sequences bigger than 3072 bytes will not be detected.

We had this problem for CSV and NDJSON formats. The way it is solved for these formats is the last line of input is ignored because it could be truncated at readLimit index.
I think you can take the same approach here too. Ignore the last CBOR in sequence because it can be truncated.
References for NDJSON and CSV:

mimetype/internal/magic/text.go

Line 230 in 7cdf684

func NdJSON(raw []byte, limit uint32) bool {

mimetype/internal/magic/text_csv.go

Line 10 in 7cdf684

func Csv(raw []byte, limit uint32) bool {

test case for large file NDJSON:

mimetype/mimetype_test.go

Line 126 in 7cdf684

"ndjson.xl.ndjson": "application/x-ndjson",

Please let me know if something is unclear or if I can help you somehow.

qiu-x · 2021-11-02T21:09:16Z

I changed the code to ignore the last CBOR if the limit is reached. I also made the test larger to include this case.

gabriel-vasile

Code runs now for bigger sequences, but I'm not sure I can follow the algorithm.
I see mt is compared to 0x40, 0x60, 0x80, 0xa0, 0xc0 at line 217. The specification uses 2, 3, 4, 5, 6, 7 for that check. Any particular reason for not using the same exact values they use in the specification?

gabriel-vasile · 2021-11-09T14:57:02Z

internal/magic/binary.go

+ return 0, false
+ }
+
+ mt := uint8(raw[offset] & 0xe0)


In the specification mt is:

mt = ib >> 5;

I used 0x40, 0x60, 0x80... because those are the raw byte values that CBOR uses. A bitwise AND with 0xe0 (11100000) has the same effect as making 5 right shifts (>> 5) - it discards the 5 last bits of the byte. Testing was a easier like this, but I will change it to 2, 3, 4... to avoid confusion.

No, >> 5 is not the same as & 0xe0. https://play.golang.org/p/NFgj0z0Mt9Q

gabriel-vasile · 2021-11-09T14:57:41Z

internal/magic/binary.go

+ }
+ val = int(BgEn.Uint64(raw[offset : offset+8]))
+ offset += 8
+ case 31:


In specification:

case 31: return well_formed_indefinite(mt, breakable);

I am returning false here because for those values well_formed_indefinite would call fail(). But in my code cborIndefinite is directly running cborHelper to avoid duplication - this is possible because the mt switch in well_formed_indefinite and well_formed is almost the same. Because of this I exclude the cases that cborIndefinite should not handle.

gabriel-vasile · 2021-11-09T15:17:34Z

internal/magic/binary.go

+ }
+
+ switch mt {
+ case 0x40, 0x60:


In specification:

switch (mt) { // case 0, 1, 7 do not have content; just use val case 2: case 3: take(val); break; // bytes/UTF-8 case 4: for (i = 0; i < val; i++) well_formed(); break; case 5: for (i = 0; i < val*2; i++) well_formed(); break; case 6: well_formed(); break; // 1 embedded data item case 7: if (ai == 24 && val < 32) fail(); // bad simple }

gabriel-vasile · 2021-11-09T15:22:53Z

internal/magic/binary.go

+ return offset, true
+}
+
+func cborIndefinite(raw []byte, mt uint8, offset int) (int, bool) {


I guess this function is the equivalent of well_formed_indefinite from specification, but it does not look like it does the same thing.

This should be equivalent to the while loops in well_formed_indefinite, but it handles all the cases, since I already excluded the bad ones earlier in the code

gabriel-vasile

The reason why I'm insisting on following the specification exactly is because minor changes to the algorithm can introduce hard to find bugs. The specification is guaranteed to be correct and bug-free.

For example:

	magic.CborSeq([]byte("\xf80"), readLimit)

triggers a panic. This would not happen if CborSeq followed the specification exactly.

qiu-x force-pushed the master branch 2 times, most recently from 3c06b20 to e58afb8 Compare October 17, 2021 18:25

Add support for CBOR sequence file format

51031b8

qiu-x force-pushed the master branch from e58afb8 to 51031b8 Compare October 17, 2021 18:33

Change names to properly describe the implementation of CBOR sequences

27f3cbb

gabriel-vasile requested changes Oct 19, 2021

View reviewed changes

internal/magic/binary.go Outdated Show resolved Hide resolved

internal/magic/binary.go Outdated Show resolved Hide resolved

Change the implementation to follow the official checking algorithm m…

6d3338b

…ore closely

A few clean ups in the CBOR sequence implementation.

c3a652f

qiu-x force-pushed the master branch from 34094af to c3a652f Compare October 29, 2021 21:06

qiu-x requested a review from gabriel-vasile October 31, 2021 07:55

gabriel-vasile reviewed Nov 1, 2021

View reviewed changes

qiu-x requested a review from gabriel-vasile November 2, 2021 21:09

qiu-x force-pushed the master branch 4 times, most recently from 023d656 to b148aa7 Compare November 5, 2021 05:24

qiu-x added 3 commits November 5, 2021 06:25

Take readLimit into account

784b8dc

Minor cleanup

573bf2f

Increase test size

20a5ca6

qiu-x force-pushed the master branch from b148aa7 to 20a5ca6 Compare November 5, 2021 05:26

gabriel-vasile reviewed Nov 9, 2021

View reviewed changes

Change cborHelper and cborIndefinite to use bit shifted values

64a3c81

qiu-x requested a review from gabriel-vasile November 12, 2021 09:56

gabriel-vasile requested changes Nov 13, 2021

View reviewed changes

qiu-x closed this Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CBOR sequence file format #194

Add support for CBOR sequence file format #194

qiu-x commented Oct 17, 2021 •

edited

Loading

qiu-x commented Oct 19, 2021 •

edited

Loading

gabriel-vasile left a comment

qiu-x commented Oct 25, 2021 •

edited

Loading

qiu-x commented Oct 29, 2021

gabriel-vasile left a comment

qiu-x commented Nov 2, 2021

gabriel-vasile left a comment

gabriel-vasile Nov 9, 2021

qiu-x Nov 12, 2021

gabriel-vasile Nov 13, 2021

gabriel-vasile Nov 9, 2021

qiu-x Nov 12, 2021 •

edited

Loading

gabriel-vasile Nov 9, 2021

gabriel-vasile Nov 9, 2021

qiu-x Nov 12, 2021 •

edited

Loading

gabriel-vasile left a comment

Add support for CBOR sequence file format #194

Add support for CBOR sequence file format #194

Conversation

qiu-x commented Oct 17, 2021 • edited Loading

qiu-x commented Oct 19, 2021 • edited Loading

gabriel-vasile left a comment

Choose a reason for hiding this comment

qiu-x commented Oct 25, 2021 • edited Loading

qiu-x commented Oct 29, 2021

gabriel-vasile left a comment

Choose a reason for hiding this comment

qiu-x commented Nov 2, 2021

gabriel-vasile left a comment

Choose a reason for hiding this comment

gabriel-vasile Nov 9, 2021

Choose a reason for hiding this comment

qiu-x Nov 12, 2021

Choose a reason for hiding this comment

gabriel-vasile Nov 13, 2021

Choose a reason for hiding this comment

gabriel-vasile Nov 9, 2021

Choose a reason for hiding this comment

qiu-x Nov 12, 2021 • edited Loading

Choose a reason for hiding this comment

gabriel-vasile Nov 9, 2021

Choose a reason for hiding this comment

gabriel-vasile Nov 9, 2021

Choose a reason for hiding this comment

qiu-x Nov 12, 2021 • edited Loading

Choose a reason for hiding this comment

gabriel-vasile left a comment

Choose a reason for hiding this comment

qiu-x commented Oct 17, 2021 •

edited

Loading

qiu-x commented Oct 19, 2021 •

edited

Loading

qiu-x commented Oct 25, 2021 •

edited

Loading

qiu-x Nov 12, 2021 •

edited

Loading

qiu-x Nov 12, 2021 •

edited

Loading