-
Notifications
You must be signed in to change notification settings - Fork 456
[CDRIVER-6017] Tweak visitor behavior when finding corrupt BSON #2019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CDRIVER-6017] Tweak visitor behavior when finding corrupt BSON #2019
Conversation
We now invoke `visit_corrupt` in more cases where `bson_visit_all` cannot properly decode elements, rather than just stopping with the boolean error code.
| * | ||
| * Passing this as a flag has no effect. | ||
| */ | ||
| BSON_VALIDATE_CORRUPT = (1 << 30), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse BSON_VALIDATE_UTF8 instead? It already "has no effect" as a behavior control flag and already describes the very UTF-8 validity checks which BSON_VALIDATE_CORRUPT is meant to describe (appears redundant).
s/BSON_VALIDATE_CORRUPT/BSON_VALIDATE_UTF8/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but VISIT_CORRUPT is now (as of this PR) also used for when a subdocument is invalid (previously this was also just return true;, meaning that bson_validate was also missing this case).
I considered adding a .visit_invalid_utf8 callback, which is potentially useful in general, but there's ambiguity in the semantics of "corruption", because the spec technically says that BSON strings are UTF-8, so a non-utf-8 string could arguably be considered corruption.
| while (_bson_iter_next_internal (iter, 0, &key, &bson_type, &unsupported)) { | ||
| if (*key && !bson_utf8_validate (key, strlen (key), false)) { | ||
| iter->err_off = iter->off; | ||
| VISIT_CORRUPT (iter, data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we take this opportunity to extend test suite coverage of the .visit_corrupt callback function to account for these new checks? atm there only seems to be a single test case. Perhaps the VALIDATE_TEST macro can be extended to also check .visit_corrupt callback behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change results in rejecting inserts that previously succeeded:
bson_t to_insert = BSON_INITIALIZER;
ASSERT (bson_append_utf8 (&to_insert, "v", 1, "a\xFF\b", 3));
bool ok = mongoc_collection_insert_one (coll, &to_insert, NULL, NULL, &error);
ASSERT_OR_PRINT (ok, error);
// Before PR : OK
// After PR : invalid document for insert: corrupt BSONWrite helpers validate UTF-8 by default. However, this validation does not appear documented, and appears to have always been broken.
I would rather remove the UTF-8 validation in write helpers. I expect that would be a performance improvement, and remove the (low?) risk of breaking applications inserting invalid UTF-8.
Notably: UTF-8 key validation appeared to work (and is tested in PHP mongodb/mongo-php-driver#1819 (comment)).
It seems like most drivers do not validate UTF-8 on write. See survey + slack discussion.
Proposal:
- Do not validate UTF-8 values (but validate keys) if
BSON_VALIDATE_UTF8is not set. - Remove
BSON_VALIDATE_UTF8from the default write helper flags (since UTF-8 validation never worked and the write helpers did not document expecting UTF-8 validation). - Do validate UTF-8 in
bson_validateifBSON_VALIDATE_UTF8is set (to match documented behavior ofbson_validate).
This is a minimal fix for some edge cases around BSON iteration/visitation and text validation. A larger change could be made to make the APIs more complete and easier to use, but this is enough to fix a bug related to UTF-8 validation.
We now invoke
visit_corruptin more cases wherebson_visit_allcannot properly decode elements, rather than just stopping with the boolean error code.This also adds an error code to
bson_validatefor encountering corruption. Previously, it would just seterror->code = 0and the message tocorrupt BSON, which is counter-intuitive to those that expecterror->code == 0to indicate success.See also: CDRIVER-4448