-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hotfix] Mongo CDC fails to capture collections with .
in names
#2488
[hotfix] Mongo CDC fails to capture collections with .
in names
#2488
Conversation
PTAL @Jiabao-Sun @leonardBang |
Thanks @yuxiqian for this great work. The root cause of this error is that the collection configuration option contains the '.' character, which is a reserved character in regular expressions. As a result, the collection configuration option is mistakenly inferred as a regular expression. The reason for distinguishing between regular and non-regular database and collection configuration options is because we have pushed the filtering of databases and collections down to the ChangeStream level. This allows us to save the additional query overhead of fullDocumentLookup. When specifying explicit databases and collections, the filtering overhead on ChangeStream is minimal. However, if we were to treat all database and collection configuration options as regular expressions, it would increase the computational cost and response time of MongoDB. Unfortunately, we cannot accurately determine whether a string is a regular expression or a non-regular expression because all strings can be considered as regular expressions. This has also led to these issues where accurate matching cannot be achieved. Additionally, in order to handle whether a configuration option is a regular expression or a non-regular expression, a lot of complex conditional logic has been implemented in the code. I believe it would be beneficial to explicitly distinguish between configuration options that belong to regular expressions and those that are explicit database and collection configurations. For example, we could introduce separate options such as cc @leonardBang |
Actually regex matching here works as expected ( +1 for explicit regex options, though it's irrelevant to this bug report. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…names (apache#2488) (cherry picked from commit 44cd46e)
…names (apache#2488) (cherry picked from commit 44cd46e)
For now, MongoDB allows collection names containing dots (
.
). According to Mongo CDC docs, it should be ok to match them with fully-qualified regex like this:However, it doesn't work when incremental snapshot option is enabled. Here's the minimum POC:
It might be caused by a glitch in Debezium framework which simply regards
.
as a separator between catalog, schema, and table (instead of a valid character in table name).Currently, Mongo CDC never requests schema field (
useCatalogBeforeSchema
is always true), so a weaker version ofsplitTableId
function was implemented, and hopefully it could correctly handle dots presenting in collection names.