Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error message for s3fs-mapped files that are in glacier (C4-747) #23

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

netsettler
Copy link
Contributor

@netsettler netsettler commented Feb 25, 2022

This addresses SubmitCGAP vague error from awscli (C4-747) by making a specific error message relating to s3fs. However, to do that, it requires that the environment variables CGAP_S3FS_UPLOAD_DIR and CGAP_S3FS_UPLOAD_BUCKETS are set.

@sbreiff, best would be if we could get the setup script for s3fs into SubmitCGAP as well, so that it can arrange to set those variables. From discussion on the ticket, with env vars adjusted:

#!/bin/bash
# Expects $CGAP_S3FS_UPLOAD_DIR = ~/upload_files
# and $CGAP_S3FS_UPLOAD_BUCKETS set to a string containing one or more bucket names,
# separated by line breaks.

# Install s3fs-fuse for mounting S3 buckets
sudo amazon-linux-extras install epel -y
sudo yum install s3fs-fuse -y

# Mount buckets to $CGAP_S3FS_UPLOAD_DIR directory
mkdir $CGAP_S3FS_UPLOAD_DIR
for BUCKET in $CGAP_S3FS_UPLOAD_BUCKETS
do
    s3fs $BUCKET $CGAP_S3FS_UPLOAD_DIR -o iam_role
done

# Create virtual env for package installation
python3 -m venv ~/cgap_submission
source ~/cgap_submission/bin/activate

# Run SubmitCGAP with mounted files
pip install submit_cgap

The env vars this PR uses are slightly different, so that would have to be adjusted, too. The interaction looks like:

export CGAP_S3FS_UPLOAD_BUCKETS=elasticbeanstalk-fourfront-cgap-wfoutput
export CGAP_S3FS_UPLOAD_DIR=~/upload_files/
bash upload_files.sh
source ~/cgap_submission/bin/activate
resume-uploads 76298911-78b8-4a97-8704-37f5ce391e58 -u $UPLOAD_DIR -s http://fourfront-cgaptest.9wzadzju3p.us-east-1.elasticbeanstalk.com

The interaction looks like:

Upload 1 file? [yes/no]: yes
Uploading /home/ec2-user/upload_files/6502af34-4313-4295-bdff-12991d8fcd46/GAPFI4SUPQO9.cram to item 82e5354e-cc3f-4bf5-a0d7-3948c32df0c2 ...
Going to upload /home/ec2-user/upload_files/6502af34-4313-4295-bdff-12991d8fcd46/GAPFI4SUPQO9.cram to s3://elasticbeanstalk-fourfront-cgaptest-wfoutput/82e5354e-cc3f-4bf5-a0d7-3948c32df0c2/GAPFI1HFPW19.cram.
upload failed: upload_files/6502af34-4313-4295-bdff-12991d8fcd46/GAPFI4SUPQO9.cram to s3://elasticbeanstalk-fourfront-cgaptest-wfoutput/82e5354e-cc3f-4bf5-a0d7-3948c32df0c2/GAPFI1HFPW19.cram [Errno 5] Input/output error
The file /home/ec2-user/upload_files/6502af34-4313-4295-bdff-12991d8fcd46/GAPFI4SUPQO9.cram is mapped via S3FS to DEEP_ARCHIVE storage.
RuntimeError: Upload failed with exit code 1

Comment on lines +13 to +14
* Better error diagnostics for S3FS-mounted files that are glaciated
if the ``CGAP_S3FS_UPLOAD_BUCKETS`` and ``CGAP_S3FS_UPLOAD_DIR`` environment variables are set.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Script should be updated to set these values.

metadata = s3.head_object(Bucket=mapped_bucket, Key=mapped_key)
storage_class = metadata['StorageClass']
if storage_class not in AVAILABLE_S3_STORAGE_CLASSES:
show(f"The file {filename} is mapped via S3FS to {storage_class} storage.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider expanding this to include more information. I'm not sure we should assume this statement alone is enough to convey the problem (and the solution). I would just say something like:

show(f"The file {filename} is mapped via S3FS to {storage_class} storage."
      " Use awscli or equivalent to restore this file to a storage class that it can be retrieved from: {AVAILABLE_S3_STORAGE_CLASSES}")

@netsettler
Copy link
Contributor Author

Before leaving, Sarah left a review comment on PR #25, which I'm going to close (because the part of it that was not this PR is already merged to master):

At this point maybe some documentation on how to set the CGAP_S3FS_UPLOAD_BUCKETS and CGAP_S3FS_UPLOAD_DIR ENVs correctly is needed, because the error handling isn't working as intended for me during some brief testing so far.

Copy link
Collaborator

@drio18 drio18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're moving forward with this PR, I think we should move s3fs-related documentation here instead of the portal and update the documentation in the portal to point here instead. There may be a couple of changes required to make the process functional for Windows machines and encrypted accounts.

Comment on lines +818 to +820
mapped_dir = upload_dir.rstrip('/')
pattern = f"^(?:{re.escape(mapped_dir)}|{re.escape(os.path.expanduser(mapped_dir))})/(.*)$"
m = re.match(pattern, filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work for Windows-style paths? It'd be good to add some to the tests here.

candidates = bash_enumeration(upload_buckets)
for mapped_bucket in candidates:
try:
s3.head_object(Bucket=mapped_bucket, Key=mapped_key) # an error means we're failing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this call require S3 encryption kwargs?

Comment on lines +813 to +814
upload_buckets = os.environ.get("CGAP_S3FS_UPLOAD_BUCKETS")
upload_dir = os.environ.get("CGAP_S3FS_UPLOAD_DIR")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should change these to be more generic since the process here is not specific to s3fs but rather to files mounted from S3. For example, one could use goofys to mount S3 files instead with a similar script, and the same process of checking the files' storage class would apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants