Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent file overwrite by manage-study (SCP-2698) #109

Merged
merged 8 commits into from
Oct 7, 2020
Merged

Conversation

jlchang
Copy link
Collaborator

@jlchang jlchang commented Oct 5, 2020

Currently manage-study uploads files before performing ingest without checking if a file of the same name already exists in the study bucket. manage-study should not overwrite existing, valid study files.

With this update, manage-study handles file upload more appropriately with the follow behaviors:

  1. checks that no file with the upload file name exists in the google bucket
  2. checks that file paths for upload from google buckets are vaild
  3. removes file copied for upload if upload request is refused (422 status code)
  4. if upload request for file pre-existing in study bucket, does not perform cleanup ("/" and "-" not allowed in CLUSTER names? #3 above)
  5. handles 422 server response appropriately (either 3 or 4 above)

This fulfills SCP-2698.

@codecov
Copy link

codecov bot commented Oct 5, 2020

Codecov Report

Merging #109 into master will increase coverage by 0.65%.
The diff coverage is 44.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #109      +/-   ##
==========================================
+ Coverage   22.36%   23.01%   +0.65%     
==========================================
  Files          17       17              
  Lines        2835     2876      +41     
==========================================
+ Hits          634      662      +28     
- Misses       2201     2214      +13     
Impacted Files Coverage Δ
scripts/manage_study.py 22.00% <ø> (ø)
scripts/Commandline.py 21.62% <7.69%> (-1.24%) ⬇️
scripts/scp_api.py 49.65% <55.55%> (+2.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1a3e1e...59b1b69. Read the comment docs.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice robustness improvement :). I suggested a non-blocking refinement.

82 exit-failed-to-gsutil-delete-file
83 exit-uploaded-file-deleted
84 exit-no-file-cleanup-needed
85 exit-file-not-found-in-remote-bucket
Copy link
Member

@eweitz eweitz Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't consider it blocking, but now or later we should refactor to use Python errors instead of Bash errors.

As is, upon encountering a state we consider an error, this approach does:

  • print(specific_error_message)
  • exit(specific_error_code)

That complicates future downstream error handling in Python, e.g. logging the error to Sentry or Mixpanel. Instead, raising a Python error (like we do in Ingest Pipeline here) would ease writing to terminal, log file, and external services like Sentry or Mixpanel. If custom exit codes are indeed needed, that can be done in Python error handling as shown here.

The current approach is fine for now (it doesn't cause any functional issues), but if you prefer to refactor later then please open a tech debt ticket and note this as a TODO in the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added SCP-2790 to backlog for future refactoring

Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I'm a little confused about the dual implementation of exists_in_bucket

@@ -99,6 +99,12 @@
cmdline = Commandline.Commandline()


def exists_in_bucket(bucket_file_path, mute=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this method implemented twice, but once as a @staticmethod? If they have different functions (one uses gsutil stat and another uses gsutil ls), then I don't understand why they would have the same method name.

Copy link
Collaborator Author

@jlchang jlchang Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! Eno helped me refactor so my code to make exists_in_bucket a staticmethod in the Class - didn't realize I had neglected to delete the original... deleted in 03633e1

@jlchang jlchang merged commit 9446a55 into master Oct 7, 2020
@jlchang jlchang deleted the jlc_avoid_clobber branch October 20, 2020 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants