feat: Handle Pre-filtering of tables #811

anushka-singh · 2023-10-16T19:54:25Z

Feature or Bugfix

Feature

Detail

For a dataset to make sense all the tables within a dataset should have their location pointing to the same place as the dataset S3 bucket. However it is possible that a database can have tables which do not point to the same bucket which is perfectly legal in LakeFormation. Therefore we propose that data.all automatically only lists tables that have the same S3 bucket location as the dataset. This will solve a problem for Yahoo where we want to import a database that contains many tables with different buckets. Additionally Catalog UI should also only list prefiltered tables.

Testing

Tested this in local env. I was able to create and share datasets even after pre-filtering process takes place.
Will send separate PR for unit testing.

Relates:

on import prefilter database tables based on table S3 location #745

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
- Is the input sanitized?
- What precautions are you taking before deserializing the data you consume?
- Is injection prevented by parametrizing queries?
- Have you ensured no eval or similar functions are used?
Does this PR introduce any functionality or component that requires authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
- Are you logging failed auth attempts?
Are you using or adding any cryptographic features?
- Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
Are you introducing any new policies/roles/users?
- Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

noah-paige

Tested the following in AWS Deployment

Pre-existing dataset with 2 tables:

Dataset Synchronize successfully and loads 2 tables

Pre-existing dataset with 0 table:

Dataset Synchronize successfully and loads 0 tables

Import new dataset (2 tables in 2 different buckets)

Synchronize only loads 1 table
Share Request for dataset with tables filtered and can only add the 1 table
Create worksheet and can only query the 1 table

ECS Table Sync Task

Successfully Runs and finds appropriate number of tables for each dataset

I think the code changes look good to me and UI working as expected! - approving to merge to v2m1m0

@dlpzx

### Feature or Bugfix - Feature - Bugfix - Refactoring ### Detail #### Features * Limit pivot role S3 permissions by @dlpzx in #780 * Limit pivot role KMS permissions by @dlpzx in #830 * Add configurable session timeout to IDP by @manjulaK in #786 * Allow to submit a share when you are both an approver and a requester by @zsaltys in #793 * Redirect upon creating a share request by @zsaltys in #799 * Handle Pre-filtering of tables by @anushka-singh in #811 * Email Notification on Share Workflow - Issue - 734 by @TejasRGitHub in #818 * Refactor notifications from core to modules by @dlpzx in #822 * Add frontend and backend feature flags by @zsaltys in #817 * Make hosted_zone_id optional by @lorchda in #812 #### Fixes * Add Additional Error Messages for KMS Key lookup on imported dataset by @noah-paige in #748 * Handle Environment Import of IAM service roles by @noah-paige in #749 * Build Compliant Names for Opensearch Resources by @noah-paige in #750 * Update Lambda runtime by @nikpodsh in #782 * Ensure valid environments for share request and other objects creation by @dlpzx in #781 * Fix shell true semgrep by @dlpzx in #760 * Add condition when there are no public subnets by @lorchda in #794 * Remove unused variable by @zsaltys in #815 * Check other share exists before clean up by @noah-paige in #769 ### Relates - v2.1.0 minor release ## New Contributors * @manjulaK made their first contribution in #786 * @zsaltys made their first contribution in #793 * @anushka-singh made their first contribution in #811 * @TejasRGitHub made their first contribution in #818 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Noah Paige <69586985+noah-paige@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jaidisido <jaidisido@gmail.com> Co-authored-by: mourya-33 <134511711+mourya-33@users.noreply.github.com> Co-authored-by: nikpodsh <124577300+nikpodsh@users.noreply.github.com> Co-authored-by: MK <manjula_kasturi@hotmail.com> Co-authored-by: Zilvinas Saltys <zilvinas.saltys@yahooinc.com> Co-authored-by: Daniel Lorch <98748454+lorchda@users.noreply.github.com> Co-authored-by: Anushka Singh <anushka.singh@yahooinc.com> Co-authored-by: trajopadhye <tejas.rajopadhye@yahooinc.com>

feat: Handle Pre-filtering of tables

7bbb302

noah-paige approved these changes Oct 17, 2023

View reviewed changes

noah-paige merged commit c833c26 into data-dot-all:v2m1m0 Oct 18, 2023

dlpzx added this to the v2.1.0 milestone Oct 30, 2023

dlpzx linked an issue Oct 30, 2023 that may be closed by this pull request

Enhancing Dataset representation to consider mutiple buckets & filtered list of tables #720

Closed

dlpzx modified the milestone: v2.1.0 Oct 30, 2023

dlpzx mentioned this pull request Oct 30, 2023

v2.1.0 features #840

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Handle Pre-filtering of tables #811

feat: Handle Pre-filtering of tables #811

anushka-singh commented Oct 16, 2023 •

edited

Loading

noah-paige left a comment

feat: Handle Pre-filtering of tables #811

feat: Handle Pre-filtering of tables #811

Conversation

anushka-singh commented Oct 16, 2023 • edited Loading

Feature or Bugfix

Detail

Testing

Relates:

Security

noah-paige left a comment

Choose a reason for hiding this comment

anushka-singh commented Oct 16, 2023 •

edited

Loading