-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial suggestion of a template server #3373
Conversation
Useful basis for decreased TTV, more maintainable Spark clients, and more configuration of detached GUIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding some kind of a flow chart to show the whole process of what you're describing here?
And just to clarify, the server will hold the templates that were configured by the administrators (and might have some predefined-ready-to-roll templates that we'll offer?), and the clients of the server will ask for the templates they need and populate them when they get them back?
```conf | ||
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem | ||
spark.hadoop.fs.lakefs.access_key={{.user.credentials.access_key.id}} | ||
spark.hadoop.fs.lakefs.secret_key={{.user.credentials.access_key.secret}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's a good idea to expose the secrets through templating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these are just the API access credentials that appeared on the request. So I'm assuming your complaint is relevant also (or more) to blockstore.s3.credentials
below.
If I put them here on a template, the admin can use widely-understood IAM actions to control access to the template, or to the secrets. If they're just on some endpoint, the admin needs to use a single-purpose IAM action to control access to that endpoint. How is using a consistent IAM scheme worse than exposing them on any other random API endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying about these being taken from the request.
So yes, let's discuss returning the underlying storage credentials from the configuration: I'm not referring specifically to templating - I think in general we shouldn't return them from any API endpoint.
[html/template][html/template] for safety. | ||
|
||
The template is read by lakeFS from a specific (configured) prefix, by | ||
default `lakefs://lakefs-templates/main`. IAM authorizes users to see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what we gain by storing the templates in lakeFS.
This makes it harder for us to create built-in templates.
For example: there is going to be an option to create a new repository pre-configured to be connected to Spark.
So with this suggestion, we will need to pre-upload the template to an existing repository or to the new one, just for the user to get the templated Spark snippet.
@arielshaqed, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the usual advantages of giving Git versioning semantics to everything. So I believe it makes management easier, at least in terms of history and lineage. For instance, attach the commit ID of the template that generated the file (add this as an available property, say). Now its lineage is clear. Or, a template suddenly seems different. Why did it change?
On the opposite side, so far each time we added direct access to the backing store we got into trouble. #2732 is still open, and is tough to fix. Every new feature we add tries to bypass lakeFS, and many of them end up getting into trouble because of this. #2491 is an attempt to bypass our known-racy implementation of direct access to (some) "repo-level settings", which are implemented by directly accessing the blockstore.
Also, if we implement for S3 we immediately get 2 more open issues to reimplement for Azure and GCS, and maybe another issue to reimplement for using a different S3 endpoint (if I'm using MinIO and want to access the templates controlled by lakeFS).
(I do understand that for some use-cases this complicates matters. Would you prefer to leave storing on lakeFS optional, and starting with an implementation that takes templates from the block adaptor... and future adding the option to store them in lakeFS? You would still run into loads of tech debt for supporting all object storage types, but at least we could move on to the other important features!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern was that taking templates from lakeFS complicates our first time-to-value tasks - namely showing the user Spark snippets when creating a repository. If @Jonathan-Rosenberg is happy with the current design, I am too.
Thanks; PTAL! |
|
||
| template function | value source | IAM permissions | | ||
|:------------------|:--------------------------|:---------------------------------------------| | ||
| config | lakeFS configuration[^1] | `arn:lakefs:template:::config/path/to/field` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a global, fixed whitelist of configurations we allow to expose in this way.
Some values should never be exposed, for example the installation's encryption key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense!
I can think of 2 ways to do this:
- Have a list of configuration properties that may be exported in a configuration setting.
- Specify an additional (Go...) tag on the
Config
struct that allows a configuration setting to be exposed. I think I prefer this option: it is the most static way possible, but also easy enough for developers to add more properties if needed.
In any case, I think we should still enforce IAM for that setting. Controlling everything by permissions on the path to the template is too hard to get right when you do care, and we can just add a "everything is allowed" default policy and put that on users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, managed not to post my 2 pending comments!
[html/template][html/template] for safety. | ||
|
||
The template is read by lakeFS from a specific (configured) prefix, by | ||
default `lakefs://lakefs-templates/main`. IAM authorizes users to see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the usual advantages of giving Git versioning semantics to everything. So I believe it makes management easier, at least in terms of history and lineage. For instance, attach the commit ID of the template that generated the file (add this as an available property, say). Now its lineage is clear. Or, a template suddenly seems different. Why did it change?
On the opposite side, so far each time we added direct access to the backing store we got into trouble. #2732 is still open, and is tough to fix. Every new feature we add tries to bypass lakeFS, and many of them end up getting into trouble because of this. #2491 is an attempt to bypass our known-racy implementation of direct access to (some) "repo-level settings", which are implemented by directly accessing the blockstore.
Also, if we implement for S3 we immediately get 2 more open issues to reimplement for Azure and GCS, and maybe another issue to reimplement for using a different S3 endpoint (if I'm using MinIO and want to access the templates controlled by lakeFS).
(I do understand that for some use-cases this complicates matters. Would you prefer to leave storing on lakeFS optional, and starting with an implementation that takes templates from the block adaptor... and future adding the option to store them in lakeFS? You would still run into loads of tech debt for supporting all object storage types, but at least we could move on to the other important features!)
```conf | ||
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem | ||
spark.hadoop.fs.lakefs.access_key={{.user.credentials.access_key.id}} | ||
spark.hadoop.fs.lakefs.secret_key={{.user.credentials.access_key.secret}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these are just the API access credentials that appeared on the request. So I'm assuming your complaint is relevant also (or more) to blockstore.s3.credentials
below.
If I put them here on a template, the admin can use widely-understood IAM actions to control access to the template, or to the secrets. If they're just on some endpoint, the admin needs to use a single-purpose IAM action to control access to that endpoint. How is using a consistent IAM scheme worse than exposing them on any other random API endpoint?
pkg/config/template.go
Outdated
@@ -31,8 +31,7 @@ type S3AuthInfo struct { | |||
} | |||
|
|||
// Output struct of configuration, used to validate. If you read a key using a viper accessor | |||
// rather than accessing a field of this struct, that key will *not* be validated. So don't | |||
// do that. | |||
// rather than accessing a field of this struct, that key will *not* be validated. So don'// do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, sorry!
Reverted.
54b04aa
to
3543c87
Compare
Added a |
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem | ||
{{with $creds := new_credentials}} | ||
spark.hadoop.fs.lakefs.access_key={{$creds.ID}} | ||
spark.hadoop.fs.lakefs.secret_key={{$creds.Secret}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.hadoop.fs.lakefs.secret_key={{$creds.Secret}} | |
spark.hadoop.fs.lakefs.secret_key={{$creds.Secret}} | |
{{end}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting, thanks!
Added a few questions
|
||
## Template expansion flow | ||
|
||
#### _:warning: This flow assumes templates stored on lakeFS. :warning:_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do you mean that we store the templates in a hidden repo?
- What are the main pros that you see in storing templates as objects of a certain repo as opposed to storing them on the server? Are you thinking of making the template repo easily extendable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Templates are in a repo. I don't know why we would want to hide it.
In general I believe that everything should be versioned, so everything should live in a repo. Among the many advantages of storing in a repo, for templates we have:
-
Ease of configuration. No need to overload the block adapter to read templates.
In particular, note that so far every time we decided to store files on an object store outside of lakeFS, the implementation contained serious mistakes.
-
Reproducibility and lineage. If templates change behind the user's back, they have no way of understanding why they got a particular expansion at a particular date.
-
Adaptability. I want admins to be free to change templates or add new ones.
-
Ease of development. When developing and testing improved or new templates, having a branch makes everything easier.
I would further claim that, as people who work on a versioning product, the burden of proof is with those who want not to store things inside the product, not with those who do.
```conf | ||
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem | ||
{{with $creds := new_credentials}} | ||
spark.hadoop.fs.lakefs.access_key={{$creds.ID}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.hadoop.fs.lakefs.access_key={{$creds.ID}} | |
spark.hadoop.fs.lakefs.access_key={{$creds.Key}} |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not done: it's called the "ID" everywhere, in AWS and in our docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
| config | exportable lakeFS configuration[^1] | `arn:lakefs:template:::config/path/to/field` | | ||
| object | (small) object on lakeFS | IAM read permission for that object | | ||
| contenttype | none; set `Content-Type:` | (none) | | ||
| new_credentials | new credentials added to user | `auth:CreateCredentials` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would potentially generate multiple credentials for each user. These credentials should be managed in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. However, given that there is no "usage" or other metadata associated with a key, there is no way to fetch one again. Let's solve it... later...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| config | exportable lakeFS configuration[^1] | `arn:lakefs:template:::config/path/to/field` | | ||
| object | (small) object on lakeFS | IAM read permission for that object | | ||
| contenttype | none; set `Content-Type:` | (none) | | ||
| new_credentials | new credentials added to user | `auth:CreateCredentials` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. However, given that there is no "usage" or other metadata associated with a key, there is no way to fetch one again. Let's solve it... later...
```conf | ||
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem | ||
{{with $creds := new_credentials}} | ||
spark.hadoop.fs.lakefs.access_key={{$creds.ID}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not done: it's called the "ID" everywhere, in AWS and in our docs.
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem | ||
{{with $creds := new_credentials}} | ||
spark.hadoop.fs.lakefs.access_key={{$creds.ID}} | ||
spark.hadoop.fs.lakefs.secret_key={{$creds.Secret}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Go tests and linters stuck (never showed up as running), and this is just a design doc PR with no code. So pulling by the powers vested in me as an admin. |
Useful basis for decreased TTV, more maintainable Spark clients, and more
configuration of detached GUIs.