Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more visibility on resource-link databases imported as datasets #1098

Open
dlpzx opened this issue Mar 11, 2024 · 1 comment
Open

Add more visibility on resource-link databases imported as datasets #1098

dlpzx opened this issue Mar 11, 2024 · 1 comment

Comments

@dlpzx
Copy link
Contributor

dlpzx commented Mar 11, 2024

This is an enhancement request. #1021 allows users to import a Glue database that is originally not in the same AWS account as the S3 Bucket. This scenario is very similar to the one described in this blogpost. There are data producer accounts where data is stored in S3 and then there is a central catalog account where all glue databases are created. The glue databases are then shared back with the data producer accounts as resource link databases using Lake Formation. More schematically:

In AWS:

  • Account A - Central Catalog - Original Glue database + data lake location registered in Lake Formation
  • Account B - Data producer account - S3 Bucket + Resource link database

In data.all:

  • Environment A
  • Environment B - Imported Dataset with S3 Bucket + Resource link database + ANOTHER registration in Lake Formation + Glue Crawler + IAM role that can access Bucket+resource link database

Data sharing detects the source catalog and shares the Original Glue database. If pre-requisites are met: Environment A is onboarded in data.all and the Original Glue database is tagged as explained in #1021

Issues:

  • The second registration in Lake Formation is not needed and pollutes LF
  • The Glue crawler in the producer account targets the resource link database, which does not make much sense. Instead, if anything, it should create tables in the Original database as explained in thisBlogpost.
  • No "heads up" in the UI indicating that the pre-reqs are needed
  • No visibility on whether the pre-requisites are fulfilled from the UI

Solutions

(we can implement more than one or other alternatives)

  • Add documentation in user guide - planned as part of 2.3 release
  • Store as Dataset metadata if a database is a resource link database:
    • Show in UI + info about tagging+environment should be onboarded
    • Avoid creating Glue crawler and registering the data lake location in LF for resource link databases
    • Potentially simplify sharing checks
@SofiaSazonova
Copy link
Contributor

As a relatively new user of Data.all I would love to see some more instructions directly in UI. May be they shouldn't be shown by default, but it would be nice to have (?)-icon, which can be linked to particular paragraph in user guide.

As per LF-locations, I thinks it's some kind of a bug (feature?): we should register location afterwards. I think, we need to put effort into research of this behaviour.

@dlpzx dlpzx added this to v2.7.0 Jun 17, 2024
@dlpzx dlpzx moved this to To do in v2.7.0 Jun 17, 2024
@anmolsgandhi anmolsgandhi moved this from Prioritized To do to Nominated in v2.7.0 Jun 18, 2024
@dlpzx dlpzx added this to v2.8.0 Sep 9, 2024
@github-project-automation github-project-automation bot moved this to Nominated in v2.8.0 Sep 9, 2024
@NickCorbett NickCorbett removed this from v2.7.0 Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Nominated
Development

No branches or pull requests

2 participants