Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 connector (for data detection & discovery) POC - fides #4930

Merged
merged 18 commits into from
Jun 14, 2024

Conversation

adamsachs
Copy link
Contributor

@adamsachs adamsachs commented May 29, 2024

Closes PROD-2151

‼️ includes a DB migration, bump downrev as needed!

Needed for https://github.com/ethyca/fidesplus/pull/1456. See loom there for a demo of the connector functionality that's been added here..

Description Of Changes

OSS-side changes to support s3 as (initially) a d&d data source. Eventually we'll want to support S3 as a DSR data source, but that's not supported for now.

This PR creates an "S3" connection type that provides a space for AWS authentication credentials to be stored. We implement a new (and improved!) AWS auth paradigm that can leverage an optional "assume role ARN" to assume a specific role for S3 communications. This paradigm is not yet leveraged for our other AWS connections (e.g. DynamoDB), but the underlying utility can easily be leveraged, and we should look to support this paradigm for those connection types soon.

The S3 connection type implements a basic connection test to list buckets (as that's a key operation for our d&d monitor), but no support for the methods that need to be implemented for DSR processing.

Code Changes

  • add S3Connector, ConnectionType.s3 enum, and S3Schema for secrets/credentials. add logo for the FE
  • refactor, rename and improve an aws_util..py that can be used generically to get an AWS session with provided credentials that optionally assumes a provided role ARN

Steps to Confirm

  • connection type can be selected in the UI and tested successfully - with and without a provided assume role ARN
  • further manual testing done while integrating with the associated fidesplus PR

Pre-Merge Checklist

  • All CI Pipelines Succeeded
  • Documentation:
    • documentation complete, PR opened in fidesdocs
    • documentation issue created in fidesdocs
    • if there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
  • Issue Requirements are Met
  • Relevant Follow-Up Issues Created
  • Update CHANGELOG.md
  • For API changes, the Postman collection has been updated
  • If there are any database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!

Copy link

vercel bot commented May 29, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
fides-plus-nightly ⬜️ Ignored (Inspect) Visit Preview Jun 14, 2024 6:28pm

Copy link

cypress bot commented May 31, 2024

Passing run #8335 ↗︎

0 4 0 0 Flakiness 0
⚠️ You've recorded test results over your free plan limit.
Upgrade your plan to view test results.

Details:

Merge 1c76a85 into 3b92387...
Project: fides Commit: 969fd7c591 ℹ️
Status: Passed Duration: 00:34 💡
Started: Jun 14, 2024 6:39 PM Ended: Jun 14, 2024 6:40 PM

Review all test suite changes for PR #4930 ↗︎

@adamsachs adamsachs force-pushed the asachs/PROD-2098-s3-3d-poc branch from 4c2154f to 4db3d15 Compare June 7, 2024 21:58
@adamsachs adamsachs changed the title DRAFT: s3 3d (data detection & discovery) POC - fides s3 connector (for data detection & discovery) POC - fides Jun 12, 2024
@adamsachs adamsachs self-assigned this Jun 12, 2024
Copy link
Contributor Author

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some self-review

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed name of storage_authenticator.py to make this more generic. also updated some functionality too (see below)

raise ValueError(f"Auth method not supported for S3: {auth_method}")

if assume_role_arn:
if sts_client is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wasn't sure if it was safe to call the

sts_client = session.client("sts")
sts_client.get_caller_identity()

block for all use cases with the "automatic" auth method, since we hadn't been doing that before. so rather than risk a regression, i'm only creating the sts client when secret keys are provided (as it had been done before), OR unless it's explicitly needed to assume role...the code is a bit harder to follow/less DRY, but i'd rather be safe for now...

from fides.api.schemas.storage.storage import AWSAuthMethod, StorageSecrets


def get_aws_session(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more appropriately/generically named function now. the function here can and should be used by some of our other AWS connectors/codepaths, since this seems like a pretty solid authentication paradigm to promote across AWS resource types.

@@ -25,6 +25,8 @@ class DynamoDBSchema(ConnectionConfigSecretsSchema):
sensitive=True,
)

# TODO: include an aws_assume_role_arn and more closely follow the pattern in `connection_secrets_s3`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't want to make our changes too wide-ranging quite yet, so i've just put in a note that we should look to switch dynamoDB over to this authentication paradigm soon. dynamoDB does require a region (I think?), so i'm not sure how that changes things. but I still think that provided an option to assume a role will be the right thing to support here moving forward 👍

@adamsachs adamsachs marked this pull request as ready for review June 12, 2024 14:30
Copy link

codecov bot commented Jun 12, 2024

Codecov Report

Attention: Patch coverage is 70.96774% with 27 lines in your changes missing coverage. Please review.

Project coverage is 86.19%. Comparing base (3b92387) to head (1c76a85).
Report is 1 commits behind head on main.

Files Patch % Lines
src/fides/api/service/connectors/s3_connector.py 63.33% 11 Missing ⚠️
src/fides/api/util/aws_util.py 67.74% 10 Missing ⚠️
.../connection_configuration/connection_secrets_s3.py 83.33% 3 Missing ⚠️
src/fides/api/task/task_resources.py 33.33% 2 Missing ⚠️
src/fides/api/tasks/storage.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4930      +/-   ##
==========================================
- Coverage   86.27%   86.19%   -0.08%     
==========================================
  Files         349      351       +2     
  Lines       21628    21693      +65     
  Branches     2869     2872       +3     
==========================================
+ Hits        18659    18699      +40     
- Misses       2470     2495      +25     
  Partials      499      499              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +20 to +33
aws_access_key_id: Optional[str] = Field(
title="Access Key ID",
description="Part of the credentials that provide access to your AWS account. This is required if using secret key authentication.",
)
aws_secret_access_key: Optional[str] = Field(
title="Secret Access Key",
description="Part of the credentials that provide access to your AWS account. This is required if using secret key authentication.",
sensitive=True,
)

aws_assume_role_arn: Optional[str] = Field(
title="Assume Role ARN",
description="If provided, the ARN of the role that should be assumed to connect to s3.",
)
Copy link
Contributor

@daveqnet daveqnet Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema looks good! From my perspective:

  • Access Key ID: not particularly sensitive, a bit like a username
  • Secret Access Key: sensitive, treat it like a password
  • Assume Role ARN: not particularly sensitive, again a bit like a username


from fides.api.common_exceptions import StorageUploadError
from fides.api.schemas.storage.storage import AWSAuthMethod, StorageSecrets

Copy link
Contributor

@daveqnet daveqnet Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment on the utility here rather than any specific line of code: great idea to make a generic utility than can be reused, AWS auth won't change.

As I understand it, the utility gives three authn options:

  1. Use an IAM user's long-term access key to authenticate for the entire session.
  2. Rely on boto3's automatic method of checking the local environment for different ways AWS credentials can be provided by the env, and authenticate using them for the session. There are a few of these, ranging from env vars to using the EC2 instance metadata service, all documented by AWS here.
  3. Use an IAM user's long-term access key to get temporary security credentials (a.k.a. assume an IAM role) and use these temporary creds for the session.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's just about spot on! the only slight difference is that i've also allowed the AWS credentials derived by boto3 (i.e. option #2) to be used in step #3, i.e. to assume an IAM role (if a user has specified automatic authentication method while also providing an assume role ARN).

is that even a valid use case? i was a bit unsure whether this is something that can (or should) be supported. if not, i can easily clean up the code and make this a bit more straightforward.

@adamsachs adamsachs requested a review from pattisdr June 12, 2024 21:52
@adamsachs
Copy link
Contributor Author

adamsachs commented Jun 12, 2024

@pattisdr i wanted to get this OSS-side PR on your queue for review, no big rush though. i'm still working through the final touches on the plus-side to get it ready for review, but i think this can be reviewed in isolation and don't want to leave it totally held up by the plus one!

@pattisdr
Copy link
Contributor

Starting review - apologies for the merge conflicts 😓

Copy link
Contributor

@pattisdr pattisdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me @adamsachs nice round of changes

@adamsachs
Copy link
Contributor Author

test is failing in main and is unrelated - will go ahead and merge 👍

@adamsachs adamsachs merged commit a82c7c4 into main Jun 14, 2024
46 of 48 checks passed
@adamsachs adamsachs deleted the asachs/PROD-2098-s3-3d-poc branch June 14, 2024 21:11
Copy link

cypress bot commented Jun 14, 2024

Passing run #8343 ↗︎

0 4 0 0 Flakiness 0
⚠️ You've recorded test results over your free plan limit.
Upgrade your plan to view test results.

Details:

s3 connector (for data detection & discovery) POC - fides (#4930)
Project: fides Commit: a82c7c446a
Status: Passed Duration: 00:36 💡
Started: Jun 14, 2024 9:22 PM Ended: Jun 14, 2024 9:23 PM

Review all test suite changes for PR #4930 ↗︎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants