feat(ingest): Clear dataset ownership, mark status, add browser paths transformers #2986

aseembansal-gogo · 2021-07-29T09:35:45Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable)

shirshanka · 2021-08-08T18:27:01Z

metadata-ingestion/src/datahub/ingestion/transformer/clear_dataset_ownership.py

+        if not isinstance(mce.proposedSnapshot, DatasetSnapshotClass):
+            return mce
+
+        ownership = builder.get_or_add_aspect(


There could be cases where ingestion is not bringing in owners, but owners have also been added via UI. In that case, ownership metadata will be overwritten by the clearing transformer. It might be useful to set owners to [] only if ownership is already set to some value by ingestion.

Correct. That is the only use case. This is similar to add dataset ownership transformer which will also replace the dataset owners.

e.g. in case of glue metadata I found some tables had ownership of hadoop. So I want to mark all of them as removed. I use them alongwith add_dataset_ownership to get the correct owners ingested. I can add a warning in the docs and the use case.

Note that this is true for other transformers too. add_dataset_tags will actually set dataset tags in the UI. From a transformer only perspective the naming makes sense. As it is doing work on what is sent from the source. Added a warning for the same in the docs

FYI: In the case of tags, when the UI attaches tags, those are added to EditableSchemaMetadata, so they are kept separate from the system ingested tags.

I think it is okay to implement this as per the name (clearing ownership), but I'm not sure about whether this is practically what people would want to do during an ingestion.

e.g. if you want to clear ownership for a bunch of datasets, it might be simpler to just issue REST calls to clear ownership for those dataset urns.

Our story for sending "delta-s" for ownership is not great, so until we build that, we will encounter these issues. Hopefully we can fix up these transformers in the future to implement true delta capability. So, add and remove instead of set and clear.

So if I ingest from this source periodically it will add the extra owner every time (because hadoop is set as owner by glue). And then run REST endpoint to remove owners every time. This will keep on increasing the owner aspect version every time for no reason for all the datasets.

Clearing out owners during ingestion is a cleaner way instead of letting wrong data get in and cleaning it after ingestion.

metadata-ingestion/src/datahub/ingestion/transformer/add_dataset_browse_path.py

shirshanka · 2021-08-08T18:30:50Z

metadata-ingestion/transformers.md

+If you would like to stop a dataset from appearing in the UI then you need to mark the status of the dataset
+```yaml
+transformers:
+  - type: "mark_dataset_status"


It seems a bit extreme to mark all datasets that are ingested by this source as removed.
What is the specific use-case that we are trying to address here?

The new rollback functionality of datahub ingest command might be a better way to clear out a bad ingestion?

The specific use-case is when I am trying to remove all datasets. The transformer does not mean we mark all datasets as removed. The schema and table filters are present on all datasources. They can be used to filter out what we have to mark as removed. e.g. If I have a specific list of tables that I want to remove then I can filter for them in a receipe and can mark them as removed. I'll add this in docs to explain when this can be removed.

There have been some instances of people asking "How can we remove a dataset" on the slack channel. So those people can use this transformer after filtering for the specific datasets in the source and mark them as removed.

There is also this now: https://datahubproject.io/docs/how/delete-metadata/#delete-by-urn

Regarding delete by URN I am not sure how the CLI will know what URL GMS is running on as mentioned here https://datahubspace.slack.com/archives/CV2KB471C/p1627234588045700?thread_ts=1627195316.042500&cid=CV2KB471C

I agree having the CLI would be much cleaner for one-off cases. For larger batches having a recipe would be the cleaner approach.

metadata-ingestion/transformers.md

shirshanka

Sorry it took me a bit of time to get to this review.
Added a few suggestions on code / docs and a few clarification questions to help me understand the need for some of these transformers better!

shirshanka · 2021-08-09T06:04:30Z

metadata-ingestion/transformers.md

@@ -6,6 +6,12 @@ Oftentimes we want to modify metadata before it reaches the ingestion sink – f

 Moreover, a transformer allows one to have fine-grained control over the metadata that’s ingested without having to modify the ingestion framework's code yourself. Instead, you can write your own module that can take MCEs however you like. To configure the recipe, all that's needed is a module name as well as any arguments.

+## Note about transformers use
+
+The naming of the transformers is based on what they do to the data created by ingestion sources. It is not based on what finally happens in datahub. e.g. `simple_add_dataset_tags` adds a set of tags to the data created by ingestion sources. So if source did not add any tags you can use `simple_add_dataset_tags` to add any tags using this transformer. But this will overwrite any tags already present in the system. If you have done an ingestion once and then added some tags through the UI this will overwrite those tags. 


Unfortunately, this is not true about all things today.
e.g. Tags and Descriptions have EditableSchemaMetadata aspect that is written to from the UI, thus keep the "system ingestion" pathway separate from the UI pathway.

Other things like Ownership don't have the same split, so they can get overwritten by ingestion pathway.

This confusion can cause some nasty surprises

metadata-ingestion/transformers.md

shirshanka

LGTM!

…s, add browse paths (datahub-project#2986)

aseembansal-gogo added 3 commits July 29, 2021 15:01

feat(ingest): transformer to clear ownership

bf82ba1

feat(ingest): transformer to clear ownership

9779806

feat(ingest): refactor to remove copy paste

59e496c

aseembansal-gogo changed the title ~~feat(ingest): Clear dataset onwership~~ feat(ingest): Clear dataset ownership transformer Jul 29, 2021

aseembansal-gogo changed the title ~~feat(ingest): Clear dataset ownership transformer~~ feat(ingest): Clear dataset ownership & mark removed transformers Jul 29, 2021

feat(transform) transformer to mark dataset status

c6a8a3f

aseembansal-gogo changed the title ~~feat(ingest): Clear dataset ownership & mark removed transformers~~ feat(ingest): Clear dataset ownership & mark status transformers Jul 29, 2021

aseembansal-gogo added 7 commits July 29, 2021 17:10

feat(transform) fix tests

98dc822

feat(transform) fix tests

bb95c49

feat(transform) reformat test files

0958cd5

feat(transform) fix tests

21c34a4

feat(transform): fix tests

a8e38e1

feat(transform): fix docstrings

1eaa2d6

feat(ingest): add transformer to add browser paths to datasets

1d12b00

aseembansal-gogo changed the title ~~feat(ingest): Clear dataset ownership & mark status transformers~~ feat(ingest): Clear dataset ownership, mark status, add browser paths transformers Aug 6, 2021

feat(ingest): add docs for transformer, change name to be consistent

1c5df4d

shirshanka reviewed Aug 8, 2021

View reviewed changes

metadata-ingestion/src/datahub/ingestion/transformer/add_dataset_browse_path.py Show resolved Hide resolved

shirshanka reviewed Aug 8, 2021

View reviewed changes

metadata-ingestion/transformers.md Show resolved Hide resolved

shirshanka requested changes Aug 8, 2021

View reviewed changes

aseembansal-gogo added 4 commits August 9, 2021 09:33

feat(ingest): incorporate code review feedback

e2b9dbd

feat(ingest): incorporate code review feedback

8fea10c

feat(ingest): revert moving check to base class

24eb2e2

feat(ingest): revert moving check to base class

862c1f3

shirshanka reviewed Aug 9, 2021

View reviewed changes

aseembansal-gogo added 2 commits August 9, 2021 12:18

feat(ingest): rename transformer name to be consistent, fix docs

e9097dc

feat(ingest): rename transformer name to be consistent, fix docs

1519be0

shirshanka reviewed Aug 9, 2021

View reviewed changes

metadata-ingestion/transformers.md Outdated Show resolved Hide resolved

feat(ingest): fix typo

c546167

shirshanka approved these changes Aug 9, 2021

View reviewed changes

shirshanka merged commit 9ee091f into datahub-project:master Aug 9, 2021

aseembansal-gogo deleted the clear_dataset_onwership branch August 10, 2021 04:23

rahulbsw pushed a commit to rahulbsw/datahub that referenced this pull request Sep 2, 2021

feat(ingest): add transformers to clear dataset ownership, mark statu…

37b0ec1

…s, add browse paths (datahub-project#2986)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): Clear dataset ownership, mark status, add browser paths transformers #2986

feat(ingest): Clear dataset ownership, mark status, add browser paths transformers #2986

aseembansal-gogo commented Jul 29, 2021

shirshanka Aug 8, 2021

aseembansal-gogo Aug 9, 2021 •

edited

Loading

aseembansal-gogo Aug 9, 2021

shirshanka Aug 9, 2021

aseembansal-gogo Aug 9, 2021

shirshanka Aug 8, 2021

aseembansal-gogo Aug 9, 2021

aseembansal-gogo Aug 9, 2021

shirshanka Aug 9, 2021

aseembansal-gogo Aug 9, 2021

shirshanka left a comment

shirshanka Aug 9, 2021

aseembansal-gogo Aug 9, 2021

shirshanka left a comment

feat(ingest): Clear dataset ownership, mark status, add browser paths transformers #2986

feat(ingest): Clear dataset ownership, mark status, add browser paths transformers #2986

Conversation

aseembansal-gogo commented Jul 29, 2021

Checklist

Choose a reason for hiding this comment

aseembansal-gogo Aug 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shirshanka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shirshanka left a comment

Choose a reason for hiding this comment

aseembansal-gogo Aug 9, 2021 •

edited

Loading