fix: sanitize messageID from \u0000 and irregular utf8 runes #4063
+250
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Issue
An incident in production was caused due to
\u0000
being passed as a"messageId"
by the client. Causing themessageId
to be stored as empty in gateway db. Even if we check for empty IDs and populate if missing.It happened because:
strings.TrimSpace
and check if empty.\u0000
was not trimmed and it was not empty\u0000
to the database\u0000
into JSONB column and we get his error\u0000
gets remove and the payload was stored to db with emptymessageId
.Solution
Proper sanitization when checking the
messageId
.The simple solution was to remove just the
\u0000
. However, I wanted to handle other issues associated with UTF8, and looked into similar issues and how we can better sanitize our input.I started with
unicode.IsPrint
, but testing under a list of invisible characters I figured there were some gaps, thus https://invisible-characters.com was introduced.Invisible characters, with or without malicious intent can confuse when debugging. Also, they are not handled properly by common string manipulation functions. Given the importance of the
messageId
, it is better to clean it properly even if it requires some extra complexity and CPU cycles. For the same reasons, I've moved the sanitization function under utils.An alternative solution is to accept a much stricter format, uuid for example. Or as an optimisation test that is a valid uuid before trying to sanitise. However, this is not a costly operation and almost all of the time no mutation is required (thus no memory allocation.
benchmark
Linear Ticket
https://linear.app/rudderstack/issue/PIPE-468/sanitize-message-id
Security