Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hotfix] Mongo CDC fails to capture collections with . in names #2488

Merged

Conversation

yuxiqian
Copy link
Contributor

@yuxiqian yuxiqian commented Sep 13, 2023

For now, MongoDB allows collection names containing dots (.). According to Mongo CDC docs, it should be ok to match them with fully-qualified regex like this:

db[.]coll[.]name[.]with[.]dots // matches collection "coll.name.with.dots" in "db" database

However, it doesn't work when incremental snapshot option is enabled. Here's the minimum POC:

@Test
public void testMatchCollectionWithDots() throws Exception {
    // 1. Given colllections:
    // db: [coll.name]
    String db = CONTAINER.executeCommandFileInSeparateDatabase("ns-dotted");

    TableResult result = submitTestCase(db, db + "[.]coll[.]name");

    // 2. Wait change stream records come
    waitForSinkSize("mongodb_sink", 3);

    // 3. Check results
    String[] expected =
            new String[] {
                String.format("+I[%s, coll.name, A101]", db),
                String.format("+I[%s, coll.name, A102]", db),
                String.format("+I[%s, coll.name, A103]", db)
            };

    List<String> actual = TestValuesTableFactory.getResults("mongodb_sink");
    assertThat(actual, containsInAnyOrder(expected));

    result.getJobClient().get().cancel().get();
}

It might be caused by a glitch in Debezium framework which simply regards . as a separator between catalog, schema, and table (instead of a valid character in table name).

Currently, Mongo CDC never requests schema field (useCatalogBeforeSchema is always true), so a weaker version of splitTableId function was implemented, and hopefully it could correctly handle dots presenting in collection names.

@yuxiqian
Copy link
Contributor Author

PTAL @Jiabao-Sun @leonardBang

@Jiabao-Sun
Copy link
Contributor

Thanks @yuxiqian for this great work.

The root cause of this error is that the collection configuration option contains the '.' character, which is a reserved character in regular expressions. As a result, the collection configuration option is mistakenly inferred as a regular expression.
We have encountered an issue where the '-' character cannot be matched before and we have resolved this problem by implementing some special conditional logic.

The reason for distinguishing between regular and non-regular database and collection configuration options is because we have pushed the filtering of databases and collections down to the ChangeStream level. This allows us to save the additional query overhead of fullDocumentLookup. When specifying explicit databases and collections, the filtering overhead on ChangeStream is minimal. However, if we were to treat all database and collection configuration options as regular expressions, it would increase the computational cost and response time of MongoDB.

Unfortunately, we cannot accurately determine whether a string is a regular expression or a non-regular expression because all strings can be considered as regular expressions. This has also led to these issues where accurate matching cannot be achieved.

Additionally, in order to handle whether a configuration option is a regular expression or a non-regular expression, a lot of complex conditional logic has been implemented in the code. I believe it would be beneficial to explicitly distinguish between configuration options that belong to regular expressions and those that are explicit database and collection configurations. For example, we could introduce separate options such as databaseRegex and collectionRegex.

cc @leonardBang

@yuxiqian
Copy link
Contributor Author

The root cause of this error is that the collection configuration option contains the '.' character, which is a reserved character in regular expressions

Actually regex matching here works as expected (db[.]coll[.]name correctly matches coll.name in db). The problem is when parallelism snapshot is enabled, Debezium can't correctly parse TableId from names like db.coll.name due to unexpected . in table names.

+1 for explicit regex options, though it's irrelevant to this bug report.

Copy link
Contributor

@Jiabao-Sun Jiabao-Sun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiabao-Sun Jiabao-Sun merged commit 44cd46e into apache:master Sep 13, 2023
@yuxiqian yuxiqian deleted the fix/mongo-cdc-collection-qualification-issue branch September 13, 2023 06:20
e-mhui pushed a commit to e-mhui/flink-cdc-connectors that referenced this pull request Oct 18, 2023
GOODBOY008 pushed a commit to GOODBOY008/flink-cdc that referenced this pull request Oct 30, 2023
GOODBOY008 pushed a commit to GOODBOY008/flink-cdc that referenced this pull request Oct 30, 2023
ChaomingZhangCN pushed a commit to ChaomingZhangCN/flink-cdc that referenced this pull request Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants