-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MongoDB query sanitization #819
Conversation
@@ -39,7 +39,7 @@ def sanitize_array(array, only_top_level, key_sanitizer) | |||
else | |||
array.map do |value| | |||
sanitize(value, only_top_level, key_sanitizer) | |||
end.uniq | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd highlight this change in the commit message, that this previously hid how many documents were inserted/updated/etc if they all had the same attributes.
It's a small change that's easily overlooked.
It's also part of the query sanitizer, which is also used by the ElasticSearch integration. I don't think it's a problem, because removing duplicate entries may hide how complex the query really was. "Did one object get inserted? Or 100.000 objects with the same structure?"
Can you add a test case for this as well? Two duplicate sanitized array entries should both appear in the sanitized result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks some things down the line, though. Our processor/agent now assumes a query is the same if a hash of the body matches.
Adding dynamic data (such as the number of documents inserted) breaks this convention and will lead to a huge increase in stored "unique" events in the database, where each permutation of the resulting array results in a new "unique" event being stored.
Ideally, the body should be the same regardless of whether one or 1000 documents were inserted/updated, or we have to update the hashing feature to take this into account and uniq it there to ensure the resulting hash is the same for 1/2/1000 documents inserted/updated/removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matsimitsu the query is not the same if a different number of documents are inserted. A query with many more documents can take a bit longer than a query that inserts just one document, right? Now it hides the different queries that are performed and squashes them into one.
This is the same of how it works for SQL queries. Sanitized INSERT query with multiple rows:
INSERT INTO `table` (`field1`, `field2`, `field3`, `field4`) VALUES (?, ?, ?, ?),(?, ?, ?, ?),(?, ?, ?, ?);
Or are you only concerned about the nested attributes like “authors”?
We were also considering storing the event as a String rather than a Data object. And removing the many occurrences of the inserted documents. That wouldn't change the uniqueness, but it would decrease the body size of the event.
The MongoDB string version for example:
{
"$db": "?",
"documents": [
{ // 10 times <--- These comments
"_id": "?",
"authors": [
{
"_id": "?",
"name": "?"
}
],
"body": "?",
"title": "?"
}
{ // 5 times <--- These comments
"_id": "?",
"authors": [
{
"_id": "?",
"name": "?"
},
{
"_id": "?",
"name": "?",
"surname": "?"
}
],
"body": "?",
"title": "?"
}
],
"insert": "posts",
"lsid": "?",
"ordered": true
}
We can also test this out on staging and seeing what the effect would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matsimitsu the query is not the same if a different number of documents are inserted.
Ahh, ok for Mongo/Elastic it was the same, regardless of the number of documents inserted. All of this digesting was because we couldn't get the line number of the query in a performant way, back when we implemented this.
The idea was that the event (in this case the query) is the same, regardless of arguments, e.g. the performance is about the query on line 12 of your Model, not about a combination of the query on line 12 of the model, and the arguments (1/2/3/10 documents).
The reason is that for an app with a wildly varying number of documents inserted, we generate tons of events, this makes it more difficult for a customer to track the performance of an event, since it's now 10 events, depending on documents inserted.
This also means the event now differs per sample, so in order to find the event with 10 documents inserted, you'd need to browse a ton of samples to find the right event, instead of having one event for one query at the same place in the event tree.
This is a departure from the current implementation and expectations and may need a bit more documentation than just a changelog line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readded the uniq
to reduce the number of events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cebec49
to
6b8f95c
Compare
This comment has been minimized.
This comment has been minimized.
This commit changes how the MongoDB queries from the Mongo Ruby Driver are sanitized. Previously, the sanitizer filtered almost the whole query, not allowing users to know which attributes and embedded documents were involved. Also, the arrays were collapsed, hiding all the elements except the first one. This might be problematic in Mongo queries as documents don't have a closed schema so the sanitization could hide some attributes. This change makes the Mongo query sanitization more similar to what we do with SQL queries using sql_lexer. Before: ```js { "$db": "?", "documents": "[?]", "insert": "posts", "lsid": "?", "ordered": true } ``` After: ```js { "$db": "?", "documents": [ { "_id": "?", "authors": [ { "_id": "?", "name": "?" }, { "_id": "?", "name": "?", "surname": "?" } ], "body": "?", "title": "?" } ], "insert": "posts", "lsid": "?", "ordered": true } ```
6b8f95c
to
f19d9dc
Compare
After this is merged, let's release an alpha, then install it in real mongodb app and see what the results are. |
This comment has been minimized.
This comment has been minimized.
While performing the daily checks some issues were found with this Pull Request.
|
- v3.0.21.alpha.1 [ci skip]
Test is running, moving to waiting. |
This commit changes how the MongoDB queries from the Mongo Ruby Driver
are sanitized.
Previously, the sanitizer filtered almost the whole query, not allowing
users to know which attributes and embedded documents were involved.
Also, the arrays were collapsed, hiding all the elements except the
first one. This might be problematic in Mongo queries as documents don't
have a closed schema so the sanitization could hide some attributes.
This change makes the Mongo query sanitization more similar to what we
do with SQL queries using sql_lexer.
Before:
After:
Fixes: #769