Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve Dataset Profiling Glue Job #649

Merged
merged 5 commits into from
Aug 10, 2023
Merged

Conversation

noah-paige
Copy link
Contributor

Feature or Bugfix

  • Bugfix

Detail

  • Specify SPARK_VERSION as an environment variable for pydeequ before import
  • Add IAM Permissions to Dataset IAM Role to Allow for Glue Job logging in CloudWatch
  • Add LF Permissions to resolve insufficient permissions error thrown when looking for default database

Relates

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@noah-paige
Copy link
Contributor Author

To note for this PR - Glue Jobs require the permissions to be able to verify the existence of the default database or create the default database if it does not exist AWS Docs

This is needed for Glue Jobs that Use Data Catalog as the Hive metastore which is needed for our Glue Profiling Job

In order to accommodate this requirement, the PR adds code to grant DESCRIBE LF Permissions to the Dataset Principals (Dataset Role, EnvGroup Role, PivotRole) in the dataset Custom Resource Lambda

If the default database does not exist with the above changes in this PR the Glue Profiling Job will still fail as of now...

@@ -49,6 +49,13 @@ def on_create(event):
except ClientError as e:
pass

default_db_exists = False
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the default does not exist?

@dlpzx
Copy link
Contributor

dlpzx commented Aug 10, 2023

So now every time we create a dataset, the dataset admin role will get permissions to the default database. Permissions to this database are needed to run the profiling job. Do you know why? is it generic for any Glue Job you need permissions to the default database?

@dlpzx
Copy link
Contributor

dlpzx commented Aug 10, 2023

We also need to fix it in modularization-main, but we need to be careful with all the PRs that are currently open

Copy link
Contributor

@dlpzx dlpzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thank you, this is a long-needed fix :)

@nikpodsh
Copy link
Contributor

@dlpzx, @noah-paige I will gladly replicate these changes to mod-main :)

@noah-paige
Copy link
Contributor Author

So now every time we create a dataset, the dataset admin role will get permissions to the default database. Permissions to this database are needed to run the profiling job. Do you know why? is it generic for any Glue Job you need permissions to the default database?

I know in documentation it calls out that Glue will also try to verify the default DB only if you enable Use Data Catalog as the Hive metastore

The permissions required are only DESCRIBE on the DB level since all Glue is doing is verifying the default DB exists by calling GetDatabase API

@noah-paige noah-paige merged commit a39fd43 into main Aug 10, 2023
nikpodsh added a commit that referenced this pull request Aug 16, 2023
Merge latest changes from main into modularization-main

It includes changes from #626, #630, #648, #649, and #651

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>
Co-authored-by: wolanlu <101870655+wolanlu@users.noreply.github.com>
Co-authored-by: Amr Saber <amr.m.saber.mail@gmail.com>
Co-authored-by: Noah Paige <69586985+noah-paige@users.noreply.github.com>
Co-authored-by: kukushking <kukushkin.anton@gmail.com>
Co-authored-by: Dariusz Osiennik <osiend@amazon.com>
Co-authored-by: Dennis Goldner <107395339+degoldner@users.noreply.github.com>
Co-authored-by: Abdulrahman Kaitoua <abdulrahman.kaitoua@polimi.it>
Co-authored-by: akaitoua-sa <126820454+akaitoua-sa@users.noreply.github.com>
Co-authored-by: Gezim Musliaj <102723839+gmuslia@users.noreply.github.com>
Co-authored-by: Rick Bernotas <97474536+rbernotas@users.noreply.github.com>
Co-authored-by: David Mutune Kimengu <57294718+kimengu-david@users.noreply.github.com>
Co-authored-by: chamcca <40579012+chamcca@users.noreply.github.com>
Co-authored-by: Dhruba <117375130+marjet26@users.noreply.github.com>
Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
Co-authored-by: Srinivas Reddy <srinivasreddych@outlook.com>
Co-authored-by: mourya-33 <134511711+mourya-33@users.noreply.github.com>
Co-authored-by: Noah Paige <noahpaig@amazon.com>
Co-authored-by: dlpzx <dlpzx@amazon.com>
@dlpzx dlpzx deleted the bugfix/dataset-profile-job branch November 8, 2023 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants