Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Handle Pre-filtering of tables #811

Merged
merged 1 commit into from
Oct 18, 2023

Conversation

anushka-singh
Copy link
Contributor

@anushka-singh anushka-singh commented Oct 16, 2023

Feature or Bugfix

  • Feature

Detail

  • For a dataset to make sense all the tables within a dataset should have their location pointing to the same place as the dataset S3 bucket. However it is possible that a database can have tables which do not point to the same bucket which is perfectly legal in LakeFormation. Therefore we propose that data.all automatically only lists tables that have the same S3 bucket location as the dataset. This will solve a problem for Yahoo where we want to import a database that contains many tables with different buckets. Additionally Catalog UI should also only list prefiltered tables.

Testing

  • Tested this in local env. I was able to create and share datasets even after pre-filtering process takes place.
  • Will send separate PR for unit testing.

Relates:

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)?
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization?
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features?
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users?
    • Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@noah-paige noah-paige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the following in AWS Deployment

Pre-existing dataset with 2 tables:

  • Dataset Synchronize successfully and loads 2 tables

Pre-existing dataset with 0 table:

  • Dataset Synchronize successfully and loads 0 tables

Import new dataset (2 tables in 2 different buckets)

  • Synchronize only loads 1 table
  • Share Request for dataset with tables filtered and can only add the 1 table
  • Create worksheet and can only query the 1 table

ECS Table Sync Task

  • Successfully Runs and finds appropriate number of tables for each dataset

I think the code changes look good to me and UI working as expected! - approving to merge to v2m1m0

@noah-paige noah-paige merged commit c833c26 into data-dot-all:v2m1m0 Oct 18, 2023
@dlpzx dlpzx added this to the v2.1.0 milestone Oct 30, 2023
@dlpzx dlpzx modified the milestone: v2.1.0 Oct 30, 2023
@dlpzx dlpzx mentioned this pull request Oct 30, 2023
dlpzx added a commit that referenced this pull request Nov 8, 2023
### Feature or Bugfix
- Feature
- Bugfix
- Refactoring

### Detail

#### Features
* Limit pivot role S3 permissions by @dlpzx in
#780
* Limit pivot role KMS permissions by @dlpzx in
#830
* Add configurable session timeout to IDP by @manjulaK in
#786
* Allow to submit a share when you are both an approver and a requester
by @zsaltys in #793
* Redirect upon creating a share request by @zsaltys in
#799
* Handle Pre-filtering of tables by @anushka-singh in
#811
* Email Notification on Share Workflow - Issue - 734 by @TejasRGitHub in
#818
* Refactor notifications from core to modules by @dlpzx in
#822
* Add frontend and backend feature flags by @zsaltys in
#817
* Make hosted_zone_id optional by @lorchda in
#812

#### Fixes
* Add Additional Error Messages for KMS Key lookup on imported dataset
by @noah-paige in #748
* Handle Environment Import of IAM service roles by @noah-paige in
#749
* Build Compliant Names for Opensearch Resources by @noah-paige in
#750
* Update Lambda runtime by @nikpodsh in
#782
* Ensure valid environments for share request and other objects creation
by @dlpzx in #781
* Fix shell true semgrep by @dlpzx in
#760
* Add condition when there are no public subnets by @lorchda in
#794
* Remove unused variable by @zsaltys in
#815
* Check other share exists before clean up by @noah-paige in
#769


### Relates
- v2.1.0 minor release

## New Contributors
* @manjulaK made their first contribution in
#786
* @zsaltys made their first contribution in
#793
* @anushka-singh made their first contribution in
#811
* @TejasRGitHub made their first contribution in
#818

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Noah Paige <69586985+noah-paige@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jaidisido <jaidisido@gmail.com>
Co-authored-by: mourya-33 <134511711+mourya-33@users.noreply.github.com>
Co-authored-by: nikpodsh <124577300+nikpodsh@users.noreply.github.com>
Co-authored-by: MK <manjula_kasturi@hotmail.com>
Co-authored-by: Zilvinas Saltys <zilvinas.saltys@yahooinc.com>
Co-authored-by: Daniel Lorch <98748454+lorchda@users.noreply.github.com>
Co-authored-by: Anushka Singh <anushka.singh@yahooinc.com>
Co-authored-by: trajopadhye <tejas.rajopadhye@yahooinc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancing Dataset representation to consider mutiple buckets & filtered list of tables
3 participants