Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial suggestion of a template server #3373

Merged
merged 4 commits into from
Jun 21, 2022
Merged

Conversation

arielshaqed
Copy link
Contributor

Useful basis for decreased TTV, more maintainable Spark clients, and more
configuration of detached GUIs.

Useful basis for decreased TTV, more maintainable Spark clients, and more
configuration of detached GUIs.
@arielshaqed arielshaqed requested a review from a team May 17, 2022 08:15
@arielshaqed arielshaqed added the exclude-changelog PR description should not be included in next release changelog label May 17, 2022
Copy link
Contributor

@Jonathan-Rosenberg Jonathan-Rosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding some kind of a flow chart to show the whole process of what you're describing here?
And just to clarify, the server will hold the templates that were configured by the administrators (and might have some predefined-ready-to-roll templates that we'll offer?), and the clients of the server will ask for the templates they need and populate them when they get them back?

```conf
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem
spark.hadoop.fs.lakefs.access_key={{.user.credentials.access_key.id}}
spark.hadoop.fs.lakefs.secret_key={{.user.credentials.access_key.secret}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's a good idea to expose the secrets through templating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these are just the API access credentials that appeared on the request. So I'm assuming your complaint is relevant also (or more) to blockstore.s3.credentials below.

If I put them here on a template, the admin can use widely-understood IAM actions to control access to the template, or to the secrets. If they're just on some endpoint, the admin needs to use a single-purpose IAM action to control access to that endpoint. How is using a consistent IAM scheme worse than exposing them on any other random API endpoint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying about these being taken from the request.

So yes, let's discuss returning the underlying storage credentials from the configuration: I'm not referring specifically to templating - I think in general we shouldn't return them from any API endpoint.

[html/template][html/template] for safety.

The template is read by lakeFS from a specific (configured) prefix, by
default `lakefs://lakefs-templates/main`. IAM authorizes users to see
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we gain by storing the templates in lakeFS.
This makes it harder for us to create built-in templates.
For example: there is going to be an option to create a new repository pre-configured to be connected to Spark.
So with this suggestion, we will need to pre-upload the template to an existing repository or to the new one, just for the user to get the templated Spark snippet.

@arielshaqed, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the usual advantages of giving Git versioning semantics to everything. So I believe it makes management easier, at least in terms of history and lineage. For instance, attach the commit ID of the template that generated the file (add this as an available property, say). Now its lineage is clear. Or, a template suddenly seems different. Why did it change?

On the opposite side, so far each time we added direct access to the backing store we got into trouble. #2732 is still open, and is tough to fix. Every new feature we add tries to bypass lakeFS, and many of them end up getting into trouble because of this. #2491 is an attempt to bypass our known-racy implementation of direct access to (some) "repo-level settings", which are implemented by directly accessing the blockstore.

Also, if we implement for S3 we immediately get 2 more open issues to reimplement for Azure and GCS, and maybe another issue to reimplement for using a different S3 endpoint (if I'm using MinIO and want to access the templates controlled by lakeFS).

(I do understand that for some use-cases this complicates matters. Would you prefer to leave storing on lakeFS optional, and starting with an implementation that takes templates from the block adaptor... and future adding the option to store them in lakeFS? You would still run into loads of tech debt for supporting all object storage types, but at least we could move on to the other important features!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern was that taking templates from lakeFS complicates our first time-to-value tasks - namely showing the user Spark snippets when creating a repository. If @Jonathan-Rosenberg is happy with the current design, I am too.

@arielshaqed
Copy link
Contributor Author

Thanks; PTAL!


| template function | value source | IAM permissions |
|:------------------|:--------------------------|:---------------------------------------------|
| config | lakeFS configuration[^1] | `arn:lakefs:template:::config/path/to/field` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a global, fixed whitelist of configurations we allow to expose in this way.
Some values should never be exposed, for example the installation's encryption key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense!

I can think of 2 ways to do this:

  1. Have a list of configuration properties that may be exported in a configuration setting.
  2. Specify an additional (Go...) tag on the Config struct that allows a configuration setting to be exposed. I think I prefer this option: it is the most static way possible, but also easy enough for developers to add more properties if needed.

In any case, I think we should still enforce IAM for that setting. Controlling everything by permissions on the path to the template is too hard to get right when you do care, and we can just add a "everything is allowed" default policy and put that on users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Contributor Author

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, managed not to post my 2 pending comments!

[html/template][html/template] for safety.

The template is read by lakeFS from a specific (configured) prefix, by
default `lakefs://lakefs-templates/main`. IAM authorizes users to see
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the usual advantages of giving Git versioning semantics to everything. So I believe it makes management easier, at least in terms of history and lineage. For instance, attach the commit ID of the template that generated the file (add this as an available property, say). Now its lineage is clear. Or, a template suddenly seems different. Why did it change?

On the opposite side, so far each time we added direct access to the backing store we got into trouble. #2732 is still open, and is tough to fix. Every new feature we add tries to bypass lakeFS, and many of them end up getting into trouble because of this. #2491 is an attempt to bypass our known-racy implementation of direct access to (some) "repo-level settings", which are implemented by directly accessing the blockstore.

Also, if we implement for S3 we immediately get 2 more open issues to reimplement for Azure and GCS, and maybe another issue to reimplement for using a different S3 endpoint (if I'm using MinIO and want to access the templates controlled by lakeFS).

(I do understand that for some use-cases this complicates matters. Would you prefer to leave storing on lakeFS optional, and starting with an implementation that takes templates from the block adaptor... and future adding the option to store them in lakeFS? You would still run into loads of tech debt for supporting all object storage types, but at least we could move on to the other important features!)

```conf
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem
spark.hadoop.fs.lakefs.access_key={{.user.credentials.access_key.id}}
spark.hadoop.fs.lakefs.secret_key={{.user.credentials.access_key.secret}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these are just the API access credentials that appeared on the request. So I'm assuming your complaint is relevant also (or more) to blockstore.s3.credentials below.

If I put them here on a template, the admin can use widely-understood IAM actions to control access to the template, or to the secrets. If they're just on some endpoint, the admin needs to use a single-purpose IAM action to control access to that endpoint. How is using a consistent IAM scheme worse than exposing them on any other random API endpoint?

@@ -31,8 +31,7 @@ type S3AuthInfo struct {
}

// Output struct of configuration, used to validate. If you read a key using a viper accessor
// rather than accessing a field of this struct, that key will *not* be validated. So don't
// do that.
// rather than accessing a field of this struct, that key will *not* be validated. So don'// do that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, sorry!
Reverted.

@arielshaqed arielshaqed force-pushed the design/template-engine branch from 54b04aa to 3543c87 Compare June 12, 2022 18:50
@arielshaqed arielshaqed requested a review from johnnyaug June 12, 2022 18:50
@arielshaqed
Copy link
Contributor Author

Added a new_credentials template function, as discussed.

spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem
{{with $creds := new_credentials}}
spark.hadoop.fs.lakefs.access_key={{$creds.ID}}
spark.hadoop.fs.lakefs.secret_key={{$creds.Secret}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spark.hadoop.fs.lakefs.secret_key={{$creds.Secret}}
spark.hadoop.fs.lakefs.secret_key={{$creds.Secret}}
{{end}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@talSofer talSofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting, thanks!
Added a few questions


## Template expansion flow

#### _:warning: This flow assumes templates stored on lakeFS. :warning:_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do you mean that we store the templates in a hidden repo?
  • What are the main pros that you see in storing templates as objects of a certain repo as opposed to storing them on the server? Are you thinking of making the template repo easily extendable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Templates are in a repo. I don't know why we would want to hide it.

In general I believe that everything should be versioned, so everything should live in a repo. Among the many advantages of storing in a repo, for templates we have:

  1. Ease of configuration. No need to overload the block adapter to read templates.

    In particular, note that so far every time we decided to store files on an object store outside of lakeFS, the implementation contained serious mistakes.

  2. Reproducibility and lineage. If templates change behind the user's back, they have no way of understanding why they got a particular expansion at a particular date.

  3. Adaptability. I want admins to be free to change templates or add new ones.

  4. Ease of development. When developing and testing improved or new templates, having a branch makes everything easier.

I would further claim that, as people who work on a versioning product, the burden of proof is with those who want not to store things inside the product, not with those who do.

```conf
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem
{{with $creds := new_credentials}}
spark.hadoop.fs.lakefs.access_key={{$creds.ID}}
Copy link
Contributor

@Jonathan-Rosenberg Jonathan-Rosenberg Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spark.hadoop.fs.lakefs.access_key={{$creds.ID}}
spark.hadoop.fs.lakefs.access_key={{$creds.Key}}

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done: it's called the "ID" everywhere, in AWS and in our docs.

Copy link
Contributor

@johnnyaug johnnyaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

| config | exportable lakeFS configuration[^1] | `arn:lakefs:template:::config/path/to/field` |
| object | (small) object on lakeFS | IAM read permission for that object |
| contenttype | none; set `Content-Type:` | (none) |
| new_credentials | new credentials added to user | `auth:CreateCredentials` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would potentially generate multiple credentials for each user. These credentials should be managed in some way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. However, given that there is no "usage" or other metadata associated with a key, there is no way to fetch one again. Let's solve it... later...

Copy link
Contributor Author

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

| config | exportable lakeFS configuration[^1] | `arn:lakefs:template:::config/path/to/field` |
| object | (small) object on lakeFS | IAM read permission for that object |
| contenttype | none; set `Content-Type:` | (none) |
| new_credentials | new credentials added to user | `auth:CreateCredentials` |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. However, given that there is no "usage" or other metadata associated with a key, there is no way to fetch one again. Let's solve it... later...

```conf
spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem
{{with $creds := new_credentials}}
spark.hadoop.fs.lakefs.access_key={{$creds.ID}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done: it's called the "ID" everywhere, in AWS and in our docs.

spark.hadoop.fs.lakefs.impl=io.lakefs.LakeFSFileSystem
{{with $creds := new_credentials}}
spark.hadoop.fs.lakefs.access_key={{$creds.ID}}
spark.hadoop.fs.lakefs.secret_key={{$creds.Secret}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@arielshaqed
Copy link
Contributor Author

Go tests and linters stuck (never showed up as running), and this is just a design doc PR with no code. So pulling by the powers vested in me as an admin.

@arielshaqed arielshaqed merged commit 09efbaf into master Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants