-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CBOR sequence file format #194
Conversation
3c06b20
to
e58afb8
Compare
After merging this it should be easy to also support the regular CBOR format(#171) using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @qiu-x
Can you please follow the cbor check algorithm more closely?
Also, consider changing functions like
// cborHelper mutates raw to point to the next cbor object
func cborHelper(raw *[]byte) bool
to
// cborHelper returns a new slice or raw containing the next cbor object
func cborHelper(raw []byte) (nextCbor []byte, ok bool)
With this change there won't be a need to dereference raw
at every step.
I removed the pointer and added a offset counter. @gabriel-vasile please let me know if I should change anything else. |
OK, after coming back to my code a few days later, everything still seems to look good now. If this gets merged I also have a pull request ready for regular CBOR support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, it looks better but there is a problem.
mimetype
has this readLimit that prevents loading huge files into memory. Because of this, CBOR sequences bigger than 3072 bytes will not be detected.
We had this problem for CSV and NDJSON formats. The way it is solved for these formats is the last line of input is ignored because it could be truncated at readLimit
index.
I think you can take the same approach here too. Ignore the last CBOR in sequence because it can be truncated.
References for NDJSON and CSV:
mimetype/internal/magic/text.go
Line 230 in 7cdf684
func NdJSON(raw []byte, limit uint32) bool { |
mimetype/internal/magic/text_csv.go
Line 10 in 7cdf684
func Csv(raw []byte, limit uint32) bool { |
test case for large file NDJSON:
Line 126 in 7cdf684
"ndjson.xl.ndjson": "application/x-ndjson", |
Please let me know if something is unclear or if I can help you somehow.
I changed the code to ignore the last CBOR if the limit is reached. I also made the test larger to include this case. |
023d656
to
b148aa7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code runs now for bigger sequences, but I'm not sure I can follow the algorithm.
I see mt
is compared to 0x40, 0x60, 0x80, 0xa0, 0xc0 at line 217. The specification uses 2, 3, 4, 5, 6, 7 for that check. Any particular reason for not using the same exact values they use in the specification?
internal/magic/binary.go
Outdated
return 0, false | ||
} | ||
|
||
mt := uint8(raw[offset] & 0xe0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the specification mt is:
mt = ib >> 5;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used 0x40, 0x60, 0x80...
because those are the raw byte values that CBOR uses. A bitwise AND with 0xe0
(11100000
) has the same effect as making 5 right shifts (>> 5
) - it discards the 5 last bits of the byte. Testing was a easier like this, but I will change it to 2, 3, 4...
to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, >> 5
is not the same as & 0xe0
. https://play.golang.org/p/NFgj0z0Mt9Q
} | ||
val = int(BgEn.Uint64(raw[offset : offset+8])) | ||
offset += 8 | ||
case 31: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In specification:
case 31:
return well_formed_indefinite(mt, breakable);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am returning false
here because for those values well_formed_indefinite
would call fail()
. But in my code cborIndefinite
is directly running cborHelper
to avoid duplication - this is possible because the mt
switch in well_formed_indefinite
and well_formed
is almost the same. Because of this I exclude the cases that cborIndefinite
should not handle.
internal/magic/binary.go
Outdated
} | ||
|
||
switch mt { | ||
case 0x40, 0x60: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In specification:
switch (mt) {
// case 0, 1, 7 do not have content; just use val
case 2: case 3: take(val); break; // bytes/UTF-8
case 4: for (i = 0; i < val; i++) well_formed(); break;
case 5: for (i = 0; i < val*2; i++) well_formed(); break;
case 6: well_formed(); break; // 1 embedded data item
case 7: if (ai == 24 && val < 32) fail(); // bad simple
}
return offset, true | ||
} | ||
|
||
func cborIndefinite(raw []byte, mt uint8, offset int) (int, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this function is the equivalent of well_formed_indefinite
from specification, but it does not look like it does the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be equivalent to the while
loops in well_formed_indefinite
, but it handles all the cases, since I already excluded the bad ones earlier in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I'm insisting on following the specification exactly is because minor changes to the algorithm can introduce hard to find bugs. The specification is guaranteed to be correct and bug-free.
For example:
magic.CborSeq([]byte("\xf80"), readLimit)
triggers a panic. This would not happen if CborSeq
followed the specification exactly.
This adds support for CBOR sequence detection.
Official spec for reference:
https://www.rfc-editor.org/rfc/rfc8742.html
(Related to #172)