Improve MongoDB query sanitization #819

luismiramirez · 2022-02-07T15:19:47Z

This commit changes how the MongoDB queries from the Mongo Ruby Driver
are sanitized.

Previously, the sanitizer filtered almost the whole query, not allowing
users to know which attributes and embedded documents were involved.

Also, the arrays were collapsed, hiding all the elements except the
first one. This might be problematic in Mongo queries as documents don't
have a closed schema so the sanitization could hide some attributes.

This change makes the Mongo query sanitization more similar to what we
do with SQL queries using sql_lexer.

Before:

{
  "$db": "?",
  "documents": "[?]",
  "insert": "posts",
  "lsid": "?",
  "ordered": true
}

After:

{
  "$db": "?",
  "documents": [
    {
      "_id": "?",
      "authors": [
        {
          "_id": "?",
          "name": "?"
        },
        {
          "_id": "?",
          "name": "?",
          "surname": "?"
        }
      ],
      "body": "?",
      "title": "?"
    }
  ],
  "insert": "posts",
  "lsid": "?",
  "ordered": true
}

Fixes: #769

.changesets/improve-mongo-ruby-driver-sanitization.md

tombruijn · 2022-02-07T15:27:25Z

lib/appsignal/utils/query_params_sanitizer.rb

@@ -39,7 +39,7 @@ def sanitize_array(array, only_top_level, key_sanitizer)
          else
            array.map do |value|
              sanitize(value, only_top_level, key_sanitizer)
-            end.uniq
+            end


I'd highlight this change in the commit message, that this previously hid how many documents were inserted/updated/etc if they all had the same attributes.
It's a small change that's easily overlooked.

It's also part of the query sanitizer, which is also used by the ElasticSearch integration. I don't think it's a problem, because removing duplicate entries may hide how complex the query really was. "Did one object get inserted? Or 100.000 objects with the same structure?"

Can you add a test case for this as well? Two duplicate sanitized array entries should both appear in the sanitized result.

This breaks some things down the line, though. Our processor/agent now assumes a query is the same if a hash of the body matches.

Adding dynamic data (such as the number of documents inserted) breaks this convention and will lead to a huge increase in stored "unique" events in the database, where each permutation of the resulting array results in a new "unique" event being stored.

Ideally, the body should be the same regardless of whether one or 1000 documents were inserted/updated, or we have to update the hashing feature to take this into account and uniq it there to ensure the resulting hash is the same for 1/2/1000 documents inserted/updated/removed.

@matsimitsu the query is not the same if a different number of documents are inserted. A query with many more documents can take a bit longer than a query that inserts just one document, right? Now it hides the different queries that are performed and squashes them into one.

This is the same of how it works for SQL queries. Sanitized INSERT query with multiple rows:

INSERT INTO `table` (`field1`, `field2`, `field3`, `field4`) VALUES (?, ?, ?, ?),(?, ?, ?, ?),(?, ?, ?, ?);

Or are you only concerned about the nested attributes like “authors”?

We were also considering storing the event as a String rather than a Data object. And removing the many occurrences of the inserted documents. That wouldn't change the uniqueness, but it would decrease the body size of the event.

The MongoDB string version for example:

{ "$db": "?", "documents": [ { // 10 times <--- These comments "_id": "?", "authors": [ { "_id": "?", "name": "?" } ], "body": "?", "title": "?" } { // 5 times <--- These comments "_id": "?", "authors": [ { "_id": "?", "name": "?" }, { "_id": "?", "name": "?", "surname": "?" } ], "body": "?", "title": "?" } ], "insert": "posts", "lsid": "?", "ordered": true }

We can also test this out on staging and seeing what the effect would be.

@matsimitsu the query is not the same if a different number of documents are inserted.

Ahh, ok for Mongo/Elastic it was the same, regardless of the number of documents inserted. All of this digesting was because we couldn't get the line number of the query in a performant way, back when we implemented this.

The idea was that the event (in this case the query) is the same, regardless of arguments, e.g. the performance is about the query on line 12 of your Model, not about a combination of the query on line 12 of the model, and the arguments (1/2/3/10 documents).

The reason is that for an app with a wildly varying number of documents inserted, we generate tons of events, this makes it more difficult for a customer to track the performance of an event, since it's now 10 events, depending on documents inserted.

This also means the event now differs per sample, so in order to find the event with 10 documents inserted, you'd need to browse a ton of samples to find the right event, instead of having one event for one query at the same place in the event tree.

This is a departure from the current implementation and expectations and may need a bit more documentation than just a changelog line.

Readded the uniq to reduce the number of events.

matsimitsu

See nested comment: https://github.com/appsignal/appsignal-ruby/pull/819/files#r800801365

This commit changes how the MongoDB queries from the Mongo Ruby Driver are sanitized. Previously, the sanitizer filtered almost the whole query, not allowing users to know which attributes and embedded documents were involved. Also, the arrays were collapsed, hiding all the elements except the first one. This might be problematic in Mongo queries as documents don't have a closed schema so the sanitization could hide some attributes. This change makes the Mongo query sanitization more similar to what we do with SQL queries using sql_lexer. Before: ```js { "$db": "?", "documents": "[?]", "insert": "posts", "lsid": "?", "ordered": true } ``` After: ```js { "$db": "?", "documents": [ { "_id": "?", "authors": [ { "_id": "?", "name": "?" }, { "_id": "?", "name": "?", "surname": "?" } ], "body": "?", "title": "?" } ], "insert": "posts", "lsid": "?", "ordered": true } ```

tombruijn · 2022-02-10T09:24:34Z

After this is merged, let's release an alpha, then install it in real mongodb app and see what the results are.

backlog-helper · 2022-02-14T08:02:50Z

While performing the daily checks some issues were found with this Pull Request.

This Pull Request needs more reviews. @jeffkreeftmeijer @unflxw - (More info)

New issue guide | Backlog management | Rules | Feedback

- v3.0.21.alpha.1 [ci skip]

tombruijn · 2022-02-14T12:26:51Z

Test is running, moving to waiting.

luismiramirez added enhancement support labels Feb 7, 2022

luismiramirez requested review from jeffkreeftmeijer, tombruijn and unflxw February 7, 2022 15:19

luismiramirez self-assigned this Feb 7, 2022

tombruijn requested changes Feb 7, 2022

View reviewed changes

matsimitsu requested changes Feb 7, 2022

View reviewed changes

tombruijn mentioned this pull request Feb 9, 2022

Check query param sanitizer for arrays flattening the value #820

Closed

2 tasks

luismiramirez force-pushed the mongo-sanitization branch from cebec49 to 6b8f95c Compare February 9, 2022 11:04

This comment has been minimized.

Sign in to view

tombruijn self-requested a review February 10, 2022 08:02

tombruijn approved these changes Feb 10, 2022

View reviewed changes

tombruijn requested a review from matsimitsu February 10, 2022 08:11

luismiramirez force-pushed the mongo-sanitization branch from 6b8f95c to f19d9dc Compare February 10, 2022 08:20

This comment has been minimized.

Sign in to view

matsimitsu approved these changes Feb 11, 2022

View reviewed changes

Publish packages

02e7caf

- v3.0.21.alpha.1 [ci skip]

jeffkreeftmeijer approved these changes Feb 14, 2022

View reviewed changes

luismiramirez merged commit b9fef97 into main Feb 15, 2022

luismiramirez deleted the mongo-sanitization branch February 15, 2022 13:58

luismiramirez mentioned this pull request Feb 16, 2022

Less restrictive Redis sanitization #822

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MongoDB query sanitization #819

Improve MongoDB query sanitization #819

luismiramirez commented Feb 7, 2022

tombruijn Feb 7, 2022

matsimitsu Feb 7, 2022

tombruijn Feb 7, 2022 •

edited

Loading

matsimitsu Feb 8, 2022 •

edited

Loading

luismiramirez Feb 9, 2022

matsimitsu left a comment

This comment has been minimized.

tombruijn commented Feb 10, 2022

This comment has been minimized.

backlog-helper bot commented Feb 14, 2022

tombruijn commented Feb 14, 2022

Improve MongoDB query sanitization #819

Improve MongoDB query sanitization #819

Conversation

luismiramirez commented Feb 7, 2022

tombruijn Feb 7, 2022

Choose a reason for hiding this comment

matsimitsu Feb 7, 2022

Choose a reason for hiding this comment

tombruijn Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

matsimitsu Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

luismiramirez Feb 9, 2022

Choose a reason for hiding this comment

matsimitsu left a comment

Choose a reason for hiding this comment

This comment has been minimized.

tombruijn commented Feb 10, 2022

This comment has been minimized.

backlog-helper bot commented Feb 14, 2022

tombruijn commented Feb 14, 2022

tombruijn Feb 7, 2022 •

edited

Loading

matsimitsu Feb 8, 2022 •

edited

Loading