Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PXP-10361 Allow specifying the upload bucket #1051

Merged
merged 12 commits into from
Oct 5, 2022
15 changes: 13 additions & 2 deletions fence/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@ def _check_azure_storage(app):
logger.debug(err)


def _check_aws_creds_and_region(app):
def _check_buckets_aws_creds_and_region(app):
"""
Function to ensure that all s3_buckets have a valid credential.
Additionally, if there is no region it will produce a warning
Expand All @@ -248,6 +248,7 @@ def _check_aws_creds_and_region(app):
buckets = config.get("S3_BUCKETS") or {}
aws_creds = config.get("AWS_CREDENTIALS") or {}

# check that AWS creds and regions are configured
for bucket_name, bucket_details in buckets.items():
cred = bucket_details.get("cred")
region = bucket_details.get("region")
Expand Down Expand Up @@ -302,6 +303,16 @@ def _check_aws_creds_and_region(app):
)
)

# check that all the configured buckets are in `S3_BUCKETS`
bucket_names = config["ALLOWED_DATA_UPLOAD_BUCKETS"] or []
if config["DATA_UPLOAD_BUCKET"]:
bucket_names.append(config["DATA_UPLOAD_BUCKET"])
for bucket_name in bucket_names:
if bucket_name not in buckets:
logger.warning(
f"Data upload bucket '{bucket_name}' is not configured in 'S3_BUCKETS'"
)


def app_config(
app,
Expand Down Expand Up @@ -358,7 +369,7 @@ def app_config(
_setup_oidc_clients(app)

with app.app_context():
_check_aws_creds_and_region(app)
_check_buckets_aws_creds_and_region(app)
_check_azure_storage(app)


Expand Down
37 changes: 33 additions & 4 deletions fence/blueprints/data/blueprint.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import flask

from cdislogging import get_logger
from cdispyutils.config import get_value

from fence.auth import login_required, require_auth_header, current_token, get_jwt
from fence.authz.auth import check_arborist_auth
Expand All @@ -9,8 +10,7 @@
IndexedFile,
get_signed_url_for_file,
)
from fence.config import config
from fence.errors import Forbidden, InternalError, UserError, Forbidden
from fence.errors import Forbidden, InternalError, UserError, Unauthorized
from fence.resources.audit.utils import enable_audit_logging
from fence.utils import get_valid_expiration

Expand Down Expand Up @@ -175,11 +175,24 @@ def upload_data_file():
)

protocol = params["protocol"] if "protocol" in params else None
bucket = params.get("bucket")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think file_upload is enough of an authorization check here. We don't want to allow upload to any bucket, especially existing buckets we may not own (but might have more permissive access like write). Just because all devs have file upload doesn't mean that someone should be able to create a signed url to upload to a bucket we don't actually own and are only supposed to be reading data from. Sure, the IAM on the bucket should disallow that, but we can't count on the security outside our control.

I would suggest we consider having a separate configuration for data upload buckets ALLOWED_DATA_UPLOAD_BUCKETS or something, that we check against first.

We may want to also consider only allowing this new feature when providing authz, as that will be required for passports going forward (so we'll want to deprecate acl even more than we are today).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this should error if the provided bucket isn't in the configured list of allowed buckets. This also protects us from malicious user input too without the need to do actual sanitization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fair

Copy link
Contributor Author

@paulineribeyre paulineribeyre Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to also consider only allowing this new feature when providing authz, as that will be required for passports going forward (so we'll want to deprecate acl even more than we are today).

I'm not sure that's a good idea because the "data upload" flow does not use authz and is still supported. Plus some people in the community have written their own code to mimic that flow (such as when they're not using S3 buckets) and i think we should allow them to use this feature

wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair, we'll still have the restriction of requiring authz for passports but I think that's fine and people will just have to adjust upload flows if they want to use that

if bucket:
s3_buckets = get_value(
flask.current_app.config,
"ALLOWED_DATA_UPLOAD_BUCKETS",
InternalError("ALLOWED_DATA_UPLOAD_BUCKETS not configured"),
)
if bucket not in s3_buckets:
logger.debug(f"Bucket '{bucket}' not in ALLOWED_DATA_UPLOAD_BUCKETS config")
raise Forbidden(f"Uploading to bucket '{bucket}' is not allowed")

response = {
"guid": blank_index.guid,
"url": blank_index.make_signed_url(
file_name=params["file_name"], protocol=protocol, expires_in=expires_in
file_name=params["file_name"],
protocol=protocol,
expires_in=expires_in,
bucket=bucket,
),
}

Expand All @@ -193,6 +206,9 @@ def upload_data_file():
def init_multipart_upload():
"""
Initialize a multipart upload request

NOTE This endpoint does not currently accept a `bucket` parameter like
`POST /upload` and `GET /upload/<GUID>` do.
"""
params = flask.request.get_json()
if not params:
Expand Down Expand Up @@ -293,7 +309,20 @@ def upload_file(file_id):
file_name = str(file_id).replace("/", "_")
logger.warning(f"file_name not provided, using '{file_name}'")

result = get_signed_url_for_file("upload", file_id, file_name=file_name)
bucket = flask.request.args.get("bucket")
if bucket:
s3_buckets = get_value(
flask.current_app.config,
"ALLOWED_DATA_UPLOAD_BUCKETS",
InternalError("ALLOWED_DATA_UPLOAD_BUCKETS not configured"),
)
if bucket not in s3_buckets:
logger.debug(f"Bucket '{bucket}' not in ALLOWED_DATA_UPLOAD_BUCKETS config")
raise Forbidden(f"Uploading to bucket '{bucket}' is not allowed")

result = get_signed_url_for_file(
"upload", file_id, file_name=file_name, bucket=bucket
)
return flask.jsonify(result)


Expand Down
38 changes: 25 additions & 13 deletions fence/blueprints/data/indexd.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ def get_signed_url_for_file(
requested_protocol=None,
ga4gh_passports=None,
db_session=None,
bucket=None,
):
requested_protocol = requested_protocol or flask.request.args.get("protocol", None)
r_pays_project = flask.request.args.get("userProject", None)
Expand Down Expand Up @@ -144,6 +145,7 @@ def get_signed_url_for_file(
r_pays_project=r_pays_project,
file_name=file_name,
users_from_passports=users_from_passports,
bucket=bucket,
)

# a single user from the list was authorized so update the audit log to reflect that
Expand Down Expand Up @@ -258,7 +260,7 @@ def index_document(self):
)
return document

def make_signed_url(self, file_name, protocol=None, expires_in=None):
def make_signed_url(self, file_name, protocol=None, expires_in=None, bucket=None):
"""
Works for upload only; S3 or Azure Blob Storage only
(only supported case for data upload flow currently).
Expand Down Expand Up @@ -286,12 +288,15 @@ def make_signed_url(self, file_name, protocol=None, expires_in=None):
"upload", expires_in
)
else:
try:
bucket = flask.current_app.config["DATA_UPLOAD_BUCKET"]
except KeyError:
raise InternalError(
"fence not configured with data upload bucket; can't create signed URL"
)
if not bucket:
try:
bucket = flask.current_app.config["DATA_UPLOAD_BUCKET"]
except KeyError:
raise InternalError(
"fence not configured with data upload bucket; can't create signed URL"
)

self.logger.debug("Attemping to upload to bucket '{}'".format(bucket))
s3_url = "s3://{}/{}/{}".format(bucket, self.guid, file_name)
url = S3IndexedFileLocation(s3_url).get_signed_url("upload", expires_in)

Expand Down Expand Up @@ -450,6 +455,7 @@ def get_signed_url(
r_pays_project=None,
file_name=None,
users_from_passports=None,
bucket=None,
):
users_from_passports = users_from_passports or {}
authorized_user = None
Expand Down Expand Up @@ -497,6 +503,7 @@ def get_signed_url(
r_pays_project,
file_name,
authorized_user,
bucket,
),
authorized_user,
)
Expand All @@ -510,14 +517,18 @@ def _get_signed_url(
r_pays_project,
file_name,
authorized_user=None,
bucket=None,
):
if action == "upload":
# NOTE: self.index_document ensures the GUID exists in indexd and raises
# an error if not (which is expected to be caught upstream in the
# app)
blank_record = BlankIndex(uploader="", guid=self.index_document.get("did"))
return blank_record.make_signed_url(
protocol=protocol, file_name=file_name, expires_in=expires_in
protocol=protocol,
file_name=file_name,
expires_in=expires_in,
bucket=bucket,
)

if not protocol:
Expand Down Expand Up @@ -876,7 +887,7 @@ def bucket_name(self):
s3_buckets = get_value(
flask.current_app.config,
"S3_BUCKETS",
InternalError("buckets not configured"),
InternalError("S3_BUCKETS not configured"),
)
for bucket in s3_buckets:
if re.match("^" + bucket + "$", self.parsed_url.netloc):
Expand All @@ -892,7 +903,7 @@ def get_credential_to_access_bucket(
cls, bucket_name, aws_creds, expires_in, boto=None
):
s3_buckets = get_value(
config, "S3_BUCKETS", InternalError("buckets not configured")
config, "S3_BUCKETS", InternalError("S3_BUCKETS not configured")
)
if len(aws_creds) == 0 and len(s3_buckets) == 0:
raise InternalError("no bucket is configured")
Expand All @@ -901,7 +912,8 @@ def get_credential_to_access_bucket(

bucket_cred = s3_buckets.get(bucket_name)
if bucket_cred is None:
raise Unauthorized("permission denied for bucket")
logger.debug(f"Bucket '{bucket_name}' not found in S3_BUCKETS config")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll never get here for data upload buckets, right? Because now we're erroring on startup. I think we should try to keep the old behavior and defer the error. Because a lot of commons don't actually use the data upload feature but might have a bucket configured (and even though Fence defaults it to '' I think cloud automation might automatically add one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll change this to a warning log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although i think it might be better to check the config at startup in general, and then assume the config is correct, but that's a discussion for later :-)

raise InternalError("permission denied for bucket")

cred_key = get_value(
bucket_cred, "cred", InternalError("credential of that bucket is missing")
Expand Down Expand Up @@ -930,7 +942,7 @@ def get_credential_to_access_bucket(

def get_bucket_region(self):
s3_buckets = get_value(
config, "S3_BUCKETS", InternalError("buckets not configured")
config, "S3_BUCKETS", InternalError("S3_BUCKETS not configured")
)
if len(s3_buckets) == 0:
return None
Expand All @@ -957,7 +969,7 @@ def get_signed_url(
config, "AWS_CREDENTIALS", InternalError("credentials not configured")
)
s3_buckets = get_value(
config, "S3_BUCKETS", InternalError("buckets not configured")
config, "S3_BUCKETS", InternalError("S3_BUCKETS not configured")
)

bucket_name = self.bucket_name()
Expand Down
11 changes: 7 additions & 4 deletions fence/config-default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -633,10 +633,13 @@ S3_BUCKETS: {}
# region: 'us-east-1'
# role-arn: 'arn:aws:iam::role1'

# `DATA_UPLOAD_BUCKET` specifies an S3 bucket to which data files are uploaded,
# using the `/data/upload` endpoint. This must be one of the first keys under
# `S3_BUCKETS` (since these are the buckets fence has credentials for).
DATA_UPLOAD_BUCKET: 'bucket1'
# Names of the S3 buckets to which data files can be uploaded. They should be
# configured in `S3_BUCKETS`.
ALLOWED_DATA_UPLOAD_BUCKETS: []

# Default S3 bucket to which data files are uploaded, when the bucket is not specified
# by the uploader. It should be configured in `S3_BUCKETS`.
DATA_UPLOAD_BUCKET: ''

# //////////////////////////////////////////////////////////////////////////////////////
# PROXY
Expand Down
2 changes: 1 addition & 1 deletion fence/resources/audit/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def __init__(self, service_url, logger):

if self.push_type == "aws_sqs":
aws_sqs_config = config["PUSH_AUDIT_LOGS_CONFIG"]["aws_sqs_config"]
# we know the cred is in AWS_CREDENTIALS (see `_check_aws_creds_and_region`)
# we know the cred is in AWS_CREDENTIALS (see `_check_buckets_aws_creds_and_region`)
aws_creds = (
config.get("AWS_CREDENTIALS", {})[aws_sqs_config["aws_cred"]]
if "aws_cred" in aws_sqs_config
Expand Down
40 changes: 38 additions & 2 deletions openapis/swagger.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -633,7 +633,15 @@ paths:
in: query
description: >-
the requested file name in the cloud bucket you will upload to.
if not provided, will use the GUID/file_id
If not provided, will use the GUID/file_id
schema:
type: string
- name: bucket
required: false
in: query
description: >-
the requested bucket to upload to.
If not provided, defaults to the configured DATA_UPLOAD_BUCKET.
schema:
type: string
responses:
Expand Down Expand Up @@ -668,7 +676,7 @@ paths:
content:
application/json:
schema:
$ref: '#/components/schemas/RequestUploadBlank'
$ref: '#/components/schemas/InitMultipartUpload'
responses:
201:
description: successful operation; created new blank record in indexd and uploadId for multipart upload
Expand Down Expand Up @@ -1649,6 +1657,34 @@ components:
user: generic user access
schemas:
RequestUploadBlank:
type: object
required:
properties:
file_name:
type: string
required: true
description: the file name to use for this upload
bucket:
type: string
required: true
description: >-
the requested bucket to upload to.
If not provided, defaults to the configured DATA_UPLOAD_BUCKET.
expires_in:
type: integer
description: optional integer specifying the presigned URL lifetime
authz:
type: array
items:
type: string
description: requested authorization resources to be set on the
resulting indexed record. You must have proper authorization to set this
example:
file_name: "my_file.bam"
bucket: "bucket1"
expires_in: 1200
authz: ["/programs/A"]
InitMultipartUpload:
type: object
required:
properties:
Expand Down
Loading