Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB sanitization uniform format #717

Closed
avzis opened this issue Jan 26, 2023 · 8 comments · Fixed by #1100
Closed

DB sanitization uniform format #717

avzis opened this issue Jan 26, 2023 · 8 comments · Fixed by #1100
Assignees

Comments

@avzis
Copy link
Contributor

avzis commented Jan 26, 2023

What are you trying to achieve?

According to the db spec, the db.statement value can be sanitized, but it is not defined how to do so.

Currently, the sanitization is being dealt with differently in few places.
I suggest to add a uniform format that will describe how to do the sanitization.
(It will be best if this format will apply to all different DB's and syntaxes)

Different implementations examples:

  • JS mongo db - implements the sanitization by replacing the information with question marks
  • Python pymongo - implements the sanitization by deleting the information completely, and leaving the query method name only
  • Python elasticsearch (WIP) - suggests to replace the information with a string that will explain that the data is sanitized

I suggest a few options to replace the value with:

  • Keep the method name, and add a sanitized text.
    for example: db.statement = "SELECT {query information is sanitized}"
    advantages: quite easy to implement, easy to keep consistent across libraries.
    dis-advantages: will require to research whether all different libraries can handle this format effectively.

  • Simple text that will describe that the value is sanitized.
    for example: db.statement = "query information is sanitized"
    advantages: easy to implement, easy to keep consistent across different libraries.
    dis-advantages: doesn't supply basic information about the query that could be useful.

  • Replace the values with question marks.
    for example: db.statement = "SELECT ? FROM ?"
    advantages: keeps more amount of information, while still not exposing sensitive or private data.
    dis-advantages: harder to implement, harder to keep consistent across libraries.

I would like to hear opinions about the suggested solutions, or hear different ideas.

Additional context.

open-telemetry/opentelemetry-specification#3104 - Issue regarding changing the recommendation to sanitize the information by default.
#708 - Issue about missing examples for sanitization in specs.

@mateuszrzeszutek
Copy link
Member

FWIW Java implements the 3rd approach in all DB clients that we currently instrument (SQL and SQL-like, Redis, MongoDb) - enabled by default, disabled by a configuration setting. We attempt to sanitize all variables, while preserving the database schema if possible.

I think that the first two approaches do not provide any useful information (the first one simply contains db.operation, the second one adds no info at all) and do not differ from the "delete db.statement" option.

@blumamir
Copy link
Member

If the data is sanitzed, and we write some text like "SELECT ? FROM ?" in the attribute value, it could be hard for backends to correctly label this value and process it.

For example, if processors try to extract the table name from the statement, they might try to parse this text and decide the table name is a valid "?" instead of "missing"/"sanitized". If the value is "SELECT {query information is sanitized}" it will probably fail parsing the SQL statement which could be either invalid behavior of the app (requires attention), or a valid "magic text" to signal lack of data.

Another example - SET foo ? - is it setting the value for key foo to ? or was it sanitized? it's impossible to tell for sure. It's another mental step for users to do in order to understand what they are looking at.

I would love to see some deterministic rules that allow the processors to differentiate between sanitized statements and actual statements. If it involve parsing or looking for magic text, that should be precisely speced.
Do you think it make sense to set the db.statement attribute with an empty string "" to signal it was not collected intentionally?

@trask
Copy link
Member

trask commented Feb 7, 2024

@open-telemetry/technical-committee can you move this to https://github.com/open-telemetry/semantic-conventions?

@jack-berg jack-berg transferred this issue from open-telemetry/opentelemetry-specification Feb 7, 2024
@trask
Copy link
Member

trask commented Apr 24, 2024

possible language for SQL queries:

replace all literal values with ?

@trask
Copy link
Member

trask commented Apr 26, 2024

should normalization be part of this or not? e.g.

normalize in (?, ?, ?, ?, ...) to in (?)

or

normalize in (?, ?, ?, ?, ...) to in (_more10_50_) to give an idea of number of items in the inlist

@maryliag
Copy link
Contributor

Keep the method name, and add a sanitized text.
for example: db.statement = "SELECT {query information is sanitized}"
advantages: quite easy to implement, easy to keep consistent across libraries.
dis-advantages: will require to research whether all different libraries can handle this format effectively.

I wouldn't say is necessarily easy, because how do you decide the "main" method on a statement that have multiple methods such as update and select. Or statements that start with WITH, how the method is decided.

Replace the values with question marks.
for example: db.statement = "SELECT ? FROM ?"
advantages: keeps more amount of information, while still not exposing sensitive or private data.
dis-advantages: harder to implement, harder to keep consistent across libraries.

For this one I think is more helpful for the user but agree it will required a harder level to implement and keep consistency. For example, on CockroachDB the replacements were something as follows:
placeholders: $1
strings: '_'
anything else: _

When there was IN clause, it would be replaced by one of the values: __more1_10__, __more10_100__, _moreX__ (with X the between 100 and 1000, such as 100, 200, 300,...) or __more1000_plus__. That created a nice balance of separating groups that would use different plan executions, but at the same time keeping cardinality lower of different possible final strings, since the list can be quite big (I saw cases with 20k+ values in a list)

Another thing to consider on this option is level of sanitization, because is more common for the values to be hidden, but not the columns, so it would need options for sanitazing just the value or the value + columns.

@trask
Copy link
Member

trask commented May 8, 2024

thoughts from SIG meeting:

  • general statement for all dbs - replace all string and numeric literals with some placeholder (maybe ?)

@trask trask assigned trask and unassigned reyang May 8, 2024
@trask
Copy link
Member

trask commented May 17, 2024

from SIG meeting:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

8 participants