Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): Clear dataset ownership, mark status, add browser paths transformers #2986

Merged

Conversation

aseembansal-gogo
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

@aseembansal-gogo aseembansal-gogo changed the title feat(ingest): Clear dataset onwership feat(ingest): Clear dataset ownership transformer Jul 29, 2021
@aseembansal-gogo aseembansal-gogo changed the title feat(ingest): Clear dataset ownership transformer feat(ingest): Clear dataset ownership & mark removed transformers Jul 29, 2021
@aseembansal-gogo aseembansal-gogo changed the title feat(ingest): Clear dataset ownership & mark removed transformers feat(ingest): Clear dataset ownership & mark status transformers Jul 29, 2021
@aseembansal-gogo aseembansal-gogo changed the title feat(ingest): Clear dataset ownership & mark status transformers feat(ingest): Clear dataset ownership, mark status, add browser paths transformers Aug 6, 2021
if not isinstance(mce.proposedSnapshot, DatasetSnapshotClass):
return mce

ownership = builder.get_or_add_aspect(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be cases where ingestion is not bringing in owners, but owners have also been added via UI. In that case, ownership metadata will be overwritten by the clearing transformer. It might be useful to set owners to [] only if ownership is already set to some value by ingestion.

Copy link
Contributor Author

@aseembansal-gogo aseembansal-gogo Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. That is the only use case. This is similar to add dataset ownership transformer which will also replace the dataset owners.

e.g. in case of glue metadata I found some tables had ownership of hadoop. So I want to mark all of them as removed. I use them alongwith add_dataset_ownership to get the correct owners ingested. I can add a warning in the docs and the use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is true for other transformers too. add_dataset_tags will actually set dataset tags in the UI. From a transformer only perspective the naming makes sense. As it is doing work on what is sent from the source. Added a warning for the same in the docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: In the case of tags, when the UI attaches tags, those are added to EditableSchemaMetadata, so they are kept separate from the system ingested tags.

I think it is okay to implement this as per the name (clearing ownership), but I'm not sure about whether this is practically what people would want to do during an ingestion.

e.g. if you want to clear ownership for a bunch of datasets, it might be simpler to just issue REST calls to clear ownership for those dataset urns.

Our story for sending "delta-s" for ownership is not great, so until we build that, we will encounter these issues. Hopefully we can fix up these transformers in the future to implement true delta capability. So, add and remove instead of set and clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I ingest from this source periodically it will add the extra owner every time (because hadoop is set as owner by glue). And then run REST endpoint to remove owners every time. This will keep on increasing the owner aspect version every time for no reason for all the datasets.

Clearing out owners during ingestion is a cleaner way instead of letting wrong data get in and cleaning it after ingestion.

If you would like to stop a dataset from appearing in the UI then you need to mark the status of the dataset
```yaml
transformers:
- type: "mark_dataset_status"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit extreme to mark all datasets that are ingested by this source as removed.
What is the specific use-case that we are trying to address here?

The new rollback functionality of datahub ingest command might be a better way to clear out a bad ingestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific use-case is when I am trying to remove all datasets. The transformer does not mean we mark all datasets as removed. The schema and table filters are present on all datasources. They can be used to filter out what we have to mark as removed. e.g. If I have a specific list of tables that I want to remove then I can filter for them in a receipe and can mark them as removed. I'll add this in docs to explain when this can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been some instances of people asking "How can we remove a dataset" on the slack channel. So those people can use this transformer after filtering for the specific datasets in the source and mark them as removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding delete by URN I am not sure how the CLI will know what URL GMS is running on as mentioned here https://datahubspace.slack.com/archives/CV2KB471C/p1627234588045700?thread_ts=1627195316.042500&cid=CV2KB471C

I agree having the CLI would be much cleaner for one-off cases. For larger batches having a recipe would be the cleaner approach.

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me a bit of time to get to this review.
Added a few suggestions on code / docs and a few clarification questions to help me understand the need for some of these transformers better!

@@ -6,6 +6,12 @@ Oftentimes we want to modify metadata before it reaches the ingestion sink – f

Moreover, a transformer allows one to have fine-grained control over the metadata that’s ingested without having to modify the ingestion framework's code yourself. Instead, you can write your own module that can take MCEs however you like. To configure the recipe, all that's needed is a module name as well as any arguments.

## Note about transformers use

The naming of the transformers is based on what they do to the data created by ingestion sources. It is not based on what finally happens in datahub. e.g. `simple_add_dataset_tags` adds a set of tags to the data created by ingestion sources. So if source did not add any tags you can use `simple_add_dataset_tags` to add any tags using this transformer. But this will overwrite any tags already present in the system. If you have done an ingestion once and then added some tags through the UI this will overwrite those tags.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is not true about all things today.
e.g. Tags and Descriptions have EditableSchemaMetadata aspect that is written to from the UI, thus keep the "system ingestion" pathway separate from the UI pathway.

Other things like Ownership don't have the same split, so they can get overwritten by ingestion pathway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confusion can cause some nasty surprises

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shirshanka shirshanka merged commit 9ee091f into datahub-project:master Aug 9, 2021
@aseembansal-gogo aseembansal-gogo deleted the clear_dataset_onwership branch August 10, 2021 04:23
rahulbsw pushed a commit to rahulbsw/datahub that referenced this pull request Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants