-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest): Clear dataset ownership, mark status, add browser paths transformers #2986
feat(ingest): Clear dataset ownership, mark status, add browser paths transformers #2986
Conversation
if not isinstance(mce.proposedSnapshot, DatasetSnapshotClass): | ||
return mce | ||
|
||
ownership = builder.get_or_add_aspect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be cases where ingestion is not bringing in owners, but owners have also been added via UI. In that case, ownership metadata will be overwritten by the clearing transformer. It might be useful to set owners to [] only if ownership is already set to some value by ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. That is the only use case. This is similar to add dataset ownership transformer which will also replace the dataset owners.
e.g. in case of glue metadata I found some tables had ownership of hadoop
. So I want to mark all of them as removed. I use them alongwith add_dataset_ownership
to get the correct owners ingested. I can add a warning in the docs and the use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is true for other transformers too. add_dataset_tags
will actually set dataset tags in the UI. From a transformer only perspective the naming makes sense. As it is doing work on what is sent from the source. Added a warning for the same in the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: In the case of tags, when the UI attaches tags, those are added to EditableSchemaMetadata, so they are kept separate from the system ingested tags.
I think it is okay to implement this as per the name (clearing ownership), but I'm not sure about whether this is practically what people would want to do during an ingestion.
e.g. if you want to clear ownership for a bunch of datasets, it might be simpler to just issue REST calls to clear ownership for those dataset urns.
Our story for sending "delta-s" for ownership is not great, so until we build that, we will encounter these issues. Hopefully we can fix up these transformers in the future to implement true delta capability. So, add
and remove
instead of set
and clear
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I ingest from this source periodically it will add the extra owner every time (because hadoop
is set as owner by glue). And then run REST endpoint to remove owners every time. This will keep on increasing the owner aspect version every time for no reason for all the datasets.
Clearing out owners during ingestion is a cleaner way instead of letting wrong data get in and cleaning it after ingestion.
metadata-ingestion/src/datahub/ingestion/transformer/add_dataset_browse_path.py
Show resolved
Hide resolved
If you would like to stop a dataset from appearing in the UI then you need to mark the status of the dataset | ||
```yaml | ||
transformers: | ||
- type: "mark_dataset_status" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit extreme to mark all datasets that are ingested by this source as removed.
What is the specific use-case that we are trying to address here?
The new rollback
functionality of datahub ingest
command might be a better way to clear out a bad ingestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific use-case is when I am trying to remove all datasets. The transformer does not mean we mark all datasets as removed. The schema and table filters are present on all datasources. They can be used to filter out what we have to mark as removed. e.g. If I have a specific list of tables that I want to remove then I can filter for them in a receipe and can mark them as removed. I'll add this in docs to explain when this can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have been some instances of people asking "How can we remove a dataset" on the slack channel. So those people can use this transformer after filtering for the specific datasets in the source and mark them as removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also this now: https://datahubproject.io/docs/how/delete-metadata/#delete-by-urn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding delete by URN I am not sure how the CLI will know what URL GMS is running on as mentioned here https://datahubspace.slack.com/archives/CV2KB471C/p1627234588045700?thread_ts=1627195316.042500&cid=CV2KB471C
I agree having the CLI would be much cleaner for one-off cases. For larger batches having a recipe would be the cleaner approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me a bit of time to get to this review.
Added a few suggestions on code / docs and a few clarification questions to help me understand the need for some of these transformers better!
metadata-ingestion/transformers.md
Outdated
@@ -6,6 +6,12 @@ Oftentimes we want to modify metadata before it reaches the ingestion sink – f | |||
|
|||
Moreover, a transformer allows one to have fine-grained control over the metadata that’s ingested without having to modify the ingestion framework's code yourself. Instead, you can write your own module that can take MCEs however you like. To configure the recipe, all that's needed is a module name as well as any arguments. | |||
|
|||
## Note about transformers use | |||
|
|||
The naming of the transformers is based on what they do to the data created by ingestion sources. It is not based on what finally happens in datahub. e.g. `simple_add_dataset_tags` adds a set of tags to the data created by ingestion sources. So if source did not add any tags you can use `simple_add_dataset_tags` to add any tags using this transformer. But this will overwrite any tags already present in the system. If you have done an ingestion once and then added some tags through the UI this will overwrite those tags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this is not true about all things today.
e.g. Tags and Descriptions have EditableSchemaMetadata aspect that is written to from the UI, thus keep the "system ingestion" pathway separate from the UI pathway.
Other things like Ownership don't have the same split, so they can get overwritten by ingestion pathway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confusion can cause some nasty surprises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…s, add browse paths (datahub-project#2986)
Checklist