fix(ingest/mongodb): Fix downsampling the collection schema output undetermined #9612

TonyOuyangGit · 2024-01-11T19:16:46Z

We noticed from recurring ingestion of MongoDB that there are changes of fields in the dataset while there is no modification of the source data.

The root cause is the ingestion will downsample the collection schema based on max_schema_size we set in the config. The collection fields are sorted by count but there is no further sorting applied when the count is the same. We should add a secondary element delimited_name to the sorted function so the output is consistent

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…datahub-project#286) This PR implements flattening the `map` attribute type when scanning table items for DynamoDB ingestion. The majority of expanding nested field code logic is adopted from `metadata-ingestion/src/datahub/ingestion/source/schema_inference/object.py`, where it recursively calls `append_schema` for `map` data type field and constructs the field path delimited by `FIELD_DELIMITER`. According to data types supported in DynamoDB in aws [docs](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes), `List` and `Map` type both support recursive structure and since it would add more complexity for expanding list or list of maps, for now we'll only expand `Map` type and will handle expanding list in the future. This PR also adopts a [fix](datahub-project#9612) in MongoDB to sort by `count` and `delimiter_name` when downsampling the table schema Updated `test_dynamodb.py` to add `List` and `Map` type items into test table and `Map` type nested fields are ingested correctly --------- Co-authored-by: Tamas Nemeth <treff7es@gmail.com>

check in fix for mongodb

81c0024

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jan 11, 2024

vercel bot deployed to Preview January 11, 2024 19:36 View deployment

hsheth2 approved these changes Jan 12, 2024

View reviewed changes

hsheth2 merged commit 33e3294 into datahub-project:master Jan 12, 2024
53 checks passed

TonyOuyangGit mentioned this pull request Feb 14, 2024

feat(ingest/dynamoDB): flatten struct fields #9852

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/mongodb): Fix downsampling the collection schema output undetermined #9612

fix(ingest/mongodb): Fix downsampling the collection schema output undetermined #9612

TonyOuyangGit commented Jan 11, 2024

fix(ingest/mongodb): Fix downsampling the collection schema output undetermined #9612

fix(ingest/mongodb): Fix downsampling the collection schema output undetermined #9612

Conversation

TonyOuyangGit commented Jan 11, 2024

Checklist