Add more visibility on resource-link databases imported as datasets #1098

dlpzx · 2024-03-11T15:56:40Z

This is an enhancement request. #1021 allows users to import a Glue database that is originally not in the same AWS account as the S3 Bucket. This scenario is very similar to the one described in this blogpost. There are data producer accounts where data is stored in S3 and then there is a central catalog account where all glue databases are created. The glue databases are then shared back with the data producer accounts as resource link databases using Lake Formation. More schematically:

In AWS:

Account A - Central Catalog - Original Glue database + data lake location registered in Lake Formation
Account B - Data producer account - S3 Bucket + Resource link database

In data.all:

Environment A
Environment B - Imported Dataset with S3 Bucket + Resource link database + ANOTHER registration in Lake Formation + Glue Crawler + IAM role that can access Bucket+resource link database

Data sharing detects the source catalog and shares the Original Glue database. If pre-requisites are met: Environment A is onboarded in data.all and the Original Glue database is tagged as explained in #1021

Issues:

The second registration in Lake Formation is not needed and pollutes LF
The Glue crawler in the producer account targets the resource link database, which does not make much sense. Instead, if anything, it should create tables in the Original database as explained in thisBlogpost.
No "heads up" in the UI indicating that the pre-reqs are needed
No visibility on whether the pre-requisites are fulfilled from the UI

Solutions

(we can implement more than one or other alternatives)

Add documentation in user guide - planned as part of 2.3 release
Store as Dataset metadata if a database is a resource link database:
- Show in UI + info about tagging+environment should be onboarded
- Avoid creating Glue crawler and registering the data lake location in LF for resource link databases
- Potentially simplify sharing checks

SofiaSazonova · 2024-03-12T14:10:55Z

As a relatively new user of Data.all I would love to see some more instructions directly in UI. May be they shouldn't be shown by default, but it would be nice to have (?)-icon, which can be linked to particular paragraph in user guide.

As per LF-locations, I thinks it's some kind of a bug (feature?): we should register location afterwards. I think, we need to put effort into research of this behaviour.

dlpzx added type: enhancement Feature enhacement priority: medium effort: medium labels Mar 22, 2024

dlpzx added this to v2.7.0 Jun 17, 2024

dlpzx moved this to To do in v2.7.0 Jun 17, 2024

anmolsgandhi moved this from Prioritized To do to Nominated in v2.7.0 Jun 18, 2024

dlpzx added the priority: low label Jul 12, 2024

dlpzx added this to v2.8.0 Sep 9, 2024

github-project-automation bot moved this to Nominated in v2.8.0 Sep 9, 2024

NickCorbett removed this from v2.7.0 Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more visibility on resource-link databases imported as datasets #1098

Add more visibility on resource-link databases imported as datasets #1098

dlpzx commented Mar 11, 2024

SofiaSazonova commented Mar 12, 2024

Add more visibility on resource-link databases imported as datasets #1098

Add more visibility on resource-link databases imported as datasets #1098

Comments

dlpzx commented Mar 11, 2024

Issues:

Solutions

SofiaSazonova commented Mar 12, 2024