Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3? #305

Open
ajstanley opened this issue Jul 22, 2021 · 14 comments
Open

S3? #305

ajstanley opened this issue Jul 22, 2021 · 14 comments
Labels
enhancement New feature or request

Comments

@ajstanley
Copy link
Contributor

We should be able to pull S3 files down (with proper creds). I'm thinking we could parse the contents of a cell's contents in the file column, and if it was in the format of (value, value, value) we could feed it to the input_dir as s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME').

I'd be happy to take this one on if it seems like a good idea to anyone else

@mjordan mjordan added the enhancement New feature or request label Jul 22, 2021
@mjordan
Copy link
Owner

mjordan commented Jul 22, 2021

Sure, sounds good. It might be good to wait to open a PR until I've merged in #287 (I'm still working on the add_media task), which should be next week (I've put aside some vacation time to finish this one).

Would the ability to fetch files from s3 require a new library?

@ajstanley
Copy link
Contributor Author

Yep - one new library. I'll wait til 287 is merged - it should be pretty strightforward.

@g7morris
Copy link

g7morris commented Jan 27, 2022

Hi @ajstanley & @mjordan

Oooo this would be very cool (and essential to cloud-based workflows) to build as a feature. Do either of you have any plans to pursue this now that the PR above was merged in?

Would be happy to help test and document if that helps.

Thanks in advance!

@ajstanley
Copy link
Contributor Author

@g7morris I kind of abandoned this when I found I could get everything from the S3 bucket via a regular URL, but I can have another go if the interest is there.

@g7morris
Copy link

@ajstanley If that is the case, perhaps it is more of just documentation then? Perhaps if folks know that they can call a S3 URL from the tool or within the csv that is enough?

@g7morris
Copy link

g7morris commented Jan 27, 2022

Either way to call from the S3 bucket from within the tool or as notated in a csv via the S3 bucket URL would be most appreciated. Thank you both

@ajstanley
Copy link
Contributor Author

I'd kinda like to see various downloads as plugins - local, https, s3 ...

@g7morris
Copy link

g7morris commented Jan 27, 2022

That sounds good too @ajstanley but in terms of using an S3 URL is it as easy as putting in the S3 bucket URL and then the object / file name in the associated csv for Workbench? I assume one might need to make the S3 publicly read only?

Ideally one could do all of this privately from an Ec2 server using aws cli and calling the s3: directly? This would make ingest fly comparatively.

@mjordan
Copy link
Owner

mjordan commented Jan 28, 2022

I'd kinda like to see various downloads as plugins - local, https, s3 .

If people would find native s3 support useful, let's add it. I originally steered away from plugins in Workbench to keep things simple, albeit "opinionated". But, I am now starting to think that enough use cases and real-world usage has emerged to justify considering them. Addressing some bugs and optimizations has prevented me from sustaining dev time on items 1 and 2 on the Fall 2021 Roadmap, and those features are really needed, so I'd rather stick to those two items. But @ajstanley if you want to sketch out how file-source plugins would work, I'd be open to discussing. One question I have before committing though is, beyond native s3 support, what other file sources would people want? If we can't think of any more, maybe we don't need the extra flexibility and complexity of plugins. Could be convinced either way.

@ajstanley
Copy link
Contributor Author

@g7morris - I have limited experience with s3 - On the one project where I used them extensively we were able to get regular https: addresses for each of the objects we needed. I think the advantage of using the s3 library is that we could pass credentials to get restricted material.

@mjordan Instead of actual plugins if we created a file class with methods for the various kinds of input, we could easily sniff out the logic to direct to the right method - ie if s3 in filename: file.getS3(filename). Not a plugin as such, but some neatly organized code with instance variables to cover commonalities.

@mjordan
Copy link
Owner

mjordan commented Jan 28, 2022

@ajstanley yeah, this is roughly how workbench_fields.py works now. I like this approach, since it lets us define a simple interface/logic in the main workbench and workbench_utils files but is fairly easy to extend.

Also makes it easier to write tests!

@dara2
Copy link

dara2 commented Dec 14, 2022

Hi @mjordan and @ajstanley - We are having some issues because of Workbench needing S3 buckets to be public. Is this on the roadmap - to allow for Workbench to pull files from a private S3 bucket?

@mjordan
Copy link
Owner

mjordan commented Dec 14, 2022

I'm afraid I have zero experience with S3, and will be fully occupied with work on #373 and #292 for the next while.

Rereading the conversation above, I'm not as sanguine about introducing a refactor of the file-fetching code as I was back in January, mainly because I've fallen behind on making sure integration tests continue to work (a bunch are broken right now). I wouldn't want to introduce such a large change without being able to run those tests. Perhaps over the holidays I can spend time outside of #373 and #292 on catching up on making the tests work, which would put us in good shape for introducing a large change like a new file-handling class early in the new year. Workbench is now so large and complex that we really need to be writing and running more integration tests.

@dara2
Copy link

dara2 commented Dec 15, 2022

Okay, that all makes sense. Thank you for the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants