Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/mongodb): Fix downsampling the collection schema output undetermined #9612

Merged

Conversation

TonyOuyangGit
Copy link
Contributor

We noticed from recurring ingestion of MongoDB that there are changes of fields in the dataset while there is no modification of the source data.

The root cause is the ingestion will downsample the collection schema based on max_schema_size we set in the config. The collection fields are sorted by count but there is no further sorting applied when the count is the same. We should add a secondary element delimited_name to the sorted function so the output is consistent

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jan 11, 2024
@hsheth2 hsheth2 merged commit 33e3294 into datahub-project:master Jan 12, 2024
53 checks passed
TonyOuyangGit added a commit to TonyOuyangGit/datahub that referenced this pull request Feb 14, 2024
…datahub-project#286)

This PR implements flattening the `map` attribute type when scanning
table items for DynamoDB ingestion. The majority of expanding nested
field code logic is adopted from
`metadata-ingestion/src/datahub/ingestion/source/schema_inference/object.py`,
where it recursively calls `append_schema` for `map` data type field and
constructs the field path delimited by `FIELD_DELIMITER`.

According to data types supported in DynamoDB in aws
[docs](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes),
`List` and `Map` type both support recursive structure and since it
would add more complexity for expanding list or list of maps, for now
we'll only expand `Map` type and will handle expanding list in the
future.

This PR also adopts a
[fix](datahub-project#9612) in MongoDB
to sort by `count` and `delimiter_name` when downsampling the table
schema

Updated `test_dynamodb.py` to add `List` and `Map` type items into test
table and `Map` type nested fields are ingested correctly

---------

Co-authored-by: Tamas Nemeth <treff7es@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants