Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MongoDB query sanitization #819

Merged
merged 2 commits into from
Feb 15, 2022
Merged

Improve MongoDB query sanitization #819

merged 2 commits into from
Feb 15, 2022

Conversation

luismiramirez
Copy link
Member

This commit changes how the MongoDB queries from the Mongo Ruby Driver
are sanitized.

Previously, the sanitizer filtered almost the whole query, not allowing
users to know which attributes and embedded documents were involved.

Also, the arrays were collapsed, hiding all the elements except the
first one. This might be problematic in Mongo queries as documents don't
have a closed schema so the sanitization could hide some attributes.

This change makes the Mongo query sanitization more similar to what we
do with SQL queries using sql_lexer.

Before:

{
  "$db": "?",
  "documents": "[?]",
  "insert": "posts",
  "lsid": "?",
  "ordered": true
}

After:

{
  "$db": "?",
  "documents": [
    {
      "_id": "?",
      "authors": [
        {
          "_id": "?",
          "name": "?"
        },
        {
          "_id": "?",
          "name": "?",
          "surname": "?"
        }
      ],
      "body": "?",
      "title": "?"
    }
  ],
  "insert": "posts",
  "lsid": "?",
  "ordered": true
}

Fixes: #769

.changesets/improve-mongo-ruby-driver-sanitization.md Outdated Show resolved Hide resolved
@@ -39,7 +39,7 @@ def sanitize_array(array, only_top_level, key_sanitizer)
else
array.map do |value|
sanitize(value, only_top_level, key_sanitizer)
end.uniq
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd highlight this change in the commit message, that this previously hid how many documents were inserted/updated/etc if they all had the same attributes.
It's a small change that's easily overlooked.

It's also part of the query sanitizer, which is also used by the ElasticSearch integration. I don't think it's a problem, because removing duplicate entries may hide how complex the query really was. "Did one object get inserted? Or 100.000 objects with the same structure?"

Can you add a test case for this as well? Two duplicate sanitized array entries should both appear in the sanitized result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks some things down the line, though. Our processor/agent now assumes a query is the same if a hash of the body matches.

Adding dynamic data (such as the number of documents inserted) breaks this convention and will lead to a huge increase in stored "unique" events in the database, where each permutation of the resulting array results in a new "unique" event being stored.

Ideally, the body should be the same regardless of whether one or 1000 documents were inserted/updated, or we have to update the hashing feature to take this into account and uniq it there to ensure the resulting hash is the same for 1/2/1000 documents inserted/updated/removed.

Copy link
Member

@tombruijn tombruijn Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matsimitsu the query is not the same if a different number of documents are inserted. A query with many more documents can take a bit longer than a query that inserts just one document, right? Now it hides the different queries that are performed and squashes them into one.

This is the same of how it works for SQL queries. Sanitized INSERT query with multiple rows:

INSERT INTO `table` (`field1`, `field2`, `field3`, `field4`) VALUES (?, ?, ?, ?),(?, ?, ?, ?),(?, ?, ?, ?);

Or are you only concerned about the nested attributes like “authors”?

We were also considering storing the event as a String rather than a Data object. And removing the many occurrences of the inserted documents. That wouldn't change the uniqueness, but it would decrease the body size of the event.

The MongoDB string version for example:

{
  "$db": "?",
  "documents": [
    { // 10 times    <--- These comments
      "_id": "?",
      "authors": [
        {
          "_id": "?",
          "name": "?"
        }
      ],
      "body": "?",
      "title": "?"
    }
    { // 5 times    <--- These comments
      "_id": "?",
      "authors": [
        {
          "_id": "?",
          "name": "?"
        },
        {
          "_id": "?",
          "name": "?",
          "surname": "?"
        }
      ],
      "body": "?",
      "title": "?"
    }
  ],
  "insert": "posts",
  "lsid": "?",
  "ordered": true
}

We can also test this out on staging and seeing what the effect would be.

Copy link
Member

@matsimitsu matsimitsu Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matsimitsu the query is not the same if a different number of documents are inserted.

Ahh, ok for Mongo/Elastic it was the same, regardless of the number of documents inserted. All of this digesting was because we couldn't get the line number of the query in a performant way, back when we implemented this.

The idea was that the event (in this case the query) is the same, regardless of arguments, e.g. the performance is about the query on line 12 of your Model, not about a combination of the query on line 12 of the model, and the arguments (1/2/3/10 documents).

The reason is that for an app with a wildly varying number of documents inserted, we generate tons of events, this makes it more difficult for a customer to track the performance of an event, since it's now 10 events, depending on documents inserted.

This also means the event now differs per sample, so in order to find the event with 10 documents inserted, you'd need to browse a ton of samples to find the right event, instead of having one event for one query at the same place in the event tree.

This is a departure from the current implementation and expectations and may need a bit more documentation than just a changelog line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readded the uniq to reduce the number of events.

Copy link
Member

@matsimitsu matsimitsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@backlog-helper

This comment has been minimized.

@tombruijn tombruijn self-requested a review February 10, 2022 08:02
@tombruijn tombruijn requested a review from matsimitsu February 10, 2022 08:11
This commit changes how the MongoDB queries from the Mongo Ruby Driver
are sanitized.

Previously, the sanitizer filtered almost the whole query, not allowing
users to know which attributes and embedded documents were involved.

Also, the arrays were collapsed, hiding all the elements except the
first one. This might be problematic in Mongo queries as documents don't
have a closed schema so the sanitization could hide some attributes.

This change makes the Mongo query sanitization more similar to what we
do with SQL queries using sql_lexer.

Before:

```js
{
  "$db": "?",
  "documents": "[?]",
  "insert": "posts",
  "lsid": "?",
  "ordered": true
}
```

After:

```js
{
  "$db": "?",
  "documents": [
    {
      "_id": "?",
      "authors": [
        {
          "_id": "?",
          "name": "?"
        },
        {
          "_id": "?",
          "name": "?",
          "surname": "?"
        }
      ],
      "body": "?",
      "title": "?"
    }
  ],
  "insert": "posts",
  "lsid": "?",
  "ordered": true
}
```
@tombruijn
Copy link
Member

After this is merged, let's release an alpha, then install it in real mongodb app and see what the results are.

@backlog-helper

This comment has been minimized.

@backlog-helper
Copy link

While performing the daily checks some issues were found with this Pull Request.


New issue guide | Backlog management | Rules | Feedback

- v3.0.21.alpha.1

[ci skip]
@tombruijn
Copy link
Member

Test is running, moving to waiting.

@luismiramirez luismiramirez merged commit b9fef97 into main Feb 15, 2022
@luismiramirez luismiramirez deleted the mongo-sanitization branch February 15, 2022 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query Sanitization
4 participants