Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/json-schema): convert non-string enums to strings #8479

Merged

Conversation

benjamin-awd
Copy link
Contributor

@benjamin-awd benjamin-awd commented Jul 21, 2023

Prior to this change, the schema enums were processed without being converted to their string representations. This caused an ingestion failure for schemas with non-string enum elements.

To address this, the code has been refactored to ensure that all elements in the schema's enum list are now converted
to their string representations using the json.dumps() function.

This change adds quotation marks around string elements, which clearly delineates strings from other special values like integers or None types. (similar to the json-schema-for-humans library: 1, 2)

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Related issue: #8480

@benjamin-awd benjamin-awd force-pushed the fix-non-string-jsonschema-enum branch from 6b5c5a7 to 72122fc Compare July 21, 2023 14:52
@benjamin-awd benjamin-awd changed the title fix(ingest): convert non-string enums to strings fix(ingest/json-schema): convert non-string enums to strings Jul 21, 2023
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 21, 2023
@shirshanka
Copy link
Contributor

Thanks for the contribution!
Could you add a test case for this here?
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/tests/unit/schema/test_json_schema_util.py

@benjamin-awd benjamin-awd force-pushed the fix-non-string-jsonschema-enum branch 2 times, most recently from e2b3623 to 95c6995 Compare July 23, 2023 07:13
@benjamin-awd
Copy link
Contributor Author

Could you add a test case for this here?

Hey @shirshanka, I've added a test -- let me know if there's anything else I'm missing.

image

Prior to this change, the schema enums were processed without being
converted to their string representations. This caused an ingestion
failure for schemas with non-string enum elements.

To address this, the code has been refactored to ensure that all
elements in the schema's enum list are now converted
to their string representations using the `repr()` function.

This change adds quotation marks around string elements, in order to
clearly delineate strings from other special values.

e.g. "Foo", "Bar", 0, null
Capitalization with the introduction of a colon creates a clean visual
break - it gives the reader a good starting point since it distinguishes
the actual elements of the enum from the general text

e.g.
one of foo,bar,baz
->
One of: 'foo', 'bar', 'baz'
@benjamin-awd benjamin-awd force-pushed the fix-non-string-jsonschema-enum branch from 95c6995 to 65889df Compare July 23, 2023 07:19
The current implementation using the repr function does not align with
JSON conventions. Notably, the repr function encloses strings in
single quotes ('baz') instead of using the standard double
quotes ("baz"), and it represents null values as
Python's None, rather than using JSON's explicit null keyword.

To ensure consistency and better adherence to JSON conventions,
we should switch to json.dumps() for the conversion.
json.dumps() ensures that strings are enclosed in double quotes and
null values are represented correctly as null.
@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Jul 24, 2023
@anshbansal anshbansal merged commit 2e2a674 into datahub-project:master Aug 1, 2023
43 checks passed
spadhi7 added a commit to spadhi7/datahub that referenced this pull request Aug 29, 2023
* tag 'v0.10.5': (222 commits)
  fix(test): increase siblings.js test stability (datahub-project#8542)
  feat(search): Allow aggregating on facets that are not explicitly part of default filter set (datahub-project#8540)
  fix(ui) Make multiple small updates to new search and browse (datahub-project#8524)
  feat(presto-on-hive): allow v1 fieldpaths in the presto-on-hive source (datahub-project#8474)
  feat(cli): Adds ability to upload recipes to DataHub's UI (datahub-project#8317)
  feat(browseV2): add browseV2 logic to system update (datahub-project#8506)
  fix(ingest/json-schema): convert non-string enums to strings (datahub-project#8479)
  feat(ingestion/tableau): support column level lineage for custom sql (datahub-project#8466)
  test(ingest): test case statements with sql parser (datahub-project#8437)
  feat(ingest/vertica): performance improvement and bug fixes (datahub-project#8328)
  ci: reduce git fetch depth (datahub-project#8473)
  fix(ingest): remove duplication of tags (datahub-project#8532)
  docs: small update to homepage (datahub-project#8483)
  fix(ingest): pin boto3-stubs in CI (datahub-project#8527)
  feat(siblings): hiding non-existant siblings in FE (datahub-project#8528)
  fix(ingest/build): Fix sagemaker mypy and flake8 issues (datahub-project#8530)
  feat(metrics): add metrics for aspect write and bytes (datahub-project#8526)
  feat(elasticsearch): allow bulk delete (datahub-project#8424)
  fix(ui): use locale lowercase when filtering columns of an entity in the lineage (datahub-project#8213)
  fix(auth): ignore case when comparing http headers (datahub-project#8356)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants