Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Authenticated inputs #2

Closed
delagoya opened this issue Sep 15, 2009 · 11 comments
Closed

Authenticated inputs #2

delagoya opened this issue Sep 15, 2009 · 11 comments

Comments

@delagoya
Copy link
Contributor

Right now CloudCrowd::Action assumes that the input resource is either a local file or accessible via a simple unauthenticated get request. There is a good case for needing to grab files from private buckets.

Stop-gap is to provide pre-authenticated URLs for the input, but it would be better to "do it right" especially since the save() call authenticates with S3 credentials...

@jashkenas
Copy link
Member

I'm not so sure that we want to assume that your Application and your CloudCrowd installation share S3 credentials. Your application might want to serve authenticated URLs directly, it might want to point (potentially) anywhere on the web... Pre-authenticated URLs seem like the way to go -- and you can pass them through an HTTPS request to CloudCrowd for real security. Each side, the application server and CloudCrowd, is responsible for controlling access to its own content, and delivering accessible URLs to the other.

However, if there's a really clean way to make direct S3 happen, I'm all for it.

@delagoya
Copy link
Contributor Author

Good points. I thought of using a config option (:use_asset_store_for_input = true ) and/or a "s3://bucket/file" url that would look in the S3 AssetStore for the input files, but I think both of these are not stellar options.

Closing the issue, as I think you are right that this stretches the bounds of the application.

@jashkenas
Copy link
Member

Ok, but an "s3://" protocol sounds pretty cool. It would probably be against all specs and sensibility to add a top-level protocol like that, but semantically it makes sense: The file:// protocol only works when you're on the same filesystem, and breaks otherwise. The s3:// protocol could only work if you're sharing access credentials, and break otherwise. Something to consider adding.

@delagoya
Copy link
Contributor Author

OK, I'll take a stab at implementing it tomorrow.

@jashkenas
Copy link
Member

Here's some precedent for the notion:

http://p.eligrey.com/

@delagoya
Copy link
Contributor Author

Do you want to abstract this out to any type of data store? E.g. have the protocol be

store://

and the AssetStore classes implement a get() method:

get(url, local_path) 

@jashkenas
Copy link
Member

That looks really, really nice. It would be totally seamless, and actions could return "store://" urls as intermediate results for further processing. We need to think about what would happen when you call save(). Should it return an authenticated URL, or a store:// URL? How to you tell?

The other thing to think about is the protocol prefix. "store://" is nice and short, but I'm not sure I'd know what it meant if I wasn't familiar with it (it might look like a shortcut to Amazon). Maybe we should do "cloudcrowd://", if you need an AssetStore implementation to handle it, or maybe we should just YAGNI and go with "s3://" until we have another backend that needs custom protocols. I'm torn.

(Sorry it took a little while to post this, with Github down).

@delagoya
Copy link
Contributor Author

Also torn.

Question: would you still use file:// as an input URL even when the AssetStore is FileSystem? E.g. is it ever the case that you want to pull from the worker's local file system as well as push files to S3? If the answer is "yes" then let's YAGNI and just use s3://

Last question, should an exception be thrown (or mark the job as failed) if no S3 credentials are supplied in the config when a worker sees s3 inputs?

@jashkenas
Copy link
Member

I think that the answer is yes if you're using some sort of distributed filesystem backend (like a shared EBS under NFS). That seems like it would be a popular option, being arguably faster than S3. In that case we'd need to make LOCAL_STORAGE_PATH configurable (you know what -- I'll just add that in a minute), and the existing FilesystemStore would do the trick. So, let's go with s3://regular/public/url...

For the last question, I'd throw a custom exception (add it to exceptions.rb), something like S3NotConfigured, which will in turn mark the work unit (and the job) as failed, all by itself.

@delagoya
Copy link
Contributor Author

OK, the repos seems to be a fast moving target, so I am going to create a branch "s3_inputs" to implement this and send you the merge request later today. Should only take me a few minutes to do.

Will add a wiki page for defining inputs which you can keep separate or merge into the job_api page as you like.

@delagoya
Copy link
Contributor Author

Closing this issue. I think pre-authenticated URLs are the way to go

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants