Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wizard MVP proposal #3499

Merged
merged 4 commits into from
Jun 21, 2022
Merged

Wizard MVP proposal #3499

merged 4 commits into from
Jun 21, 2022

Conversation

Jonathan-Rosenberg
Copy link
Contributor

This is a proposal for the time to value wizard MVP.
For further info please refer to #3411, #3373, and the Figma design.

@Jonathan-Rosenberg Jonathan-Rosenberg added exclude-changelog PR description should not be included in next release changelog team/ecosystem Team Ecosystem labels Jun 13, 2022
@Jonathan-Rosenberg Jonathan-Rosenberg marked this pull request as ready for review June 14, 2022 08:11
2. Imported data from <S3 location>.
3. Generated the following configurations:
<Spark configurations with [“<secret1>”, “<secret2>”...] hidden>
4. Instructions to configure Hive Metastore with lakeFS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, which lakectl metastore commands to run when creating/deleting/merging a branch/tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we wanted to have the following:

  1. Have a step in the Wizard that generates the Hive configurations snippet
  2. Include instructions for which lakectl commands to use when creating/deleting/merging a branch/tag (as @johnnyaug mentioned).

Do you think that the extra step is meaningless?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point one will generate configurations that are exactly the same as the Spark configurations generated...

Copy link
Contributor

@johnnyaug johnnyaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boom

Copy link
Contributor

@talSofer talSofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Can you please add a section that talks about where the templates are stored?

(forgot to hit submit yesterday...)

2. Imported data from <S3 location>.
3. Generated the following configurations:
<Spark configurations with [“<secret1>”, “<secret2>”...] hidden>
4. Instructions to configure Hive Metastore with lakeFS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we wanted to have the following:

  1. Have a step in the Wizard that generates the Hive configurations snippet
  2. Include instructions for which lakectl commands to use when creating/deleting/merging a branch/tag (as @johnnyaug mentioned).

Do you think that the extra step is meaningless?

2. Generated the following configurations:
<Spark configurations with [“<secret1>”, “<secret2>”...] hidden>
3. Instructions to configure Hive Metastore with lakeFS.
4. Generated this README file and committed it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove step number 4

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEAT!

How do we render the README? I assume this requires an additional feature.


The wizard UI component is responsible for the user’s Spark onboarding process. The process is as follows:
1. Create a repository (named as the user wishes) and a ‘main’ branch in it.
2. Import the user’s data to ‘main’ and display a progress bar (which will show a link to required permissions). Only available on cloud-deployed lakeFS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only cloud-deployed? I suspect it will only add work to enforce this


The [templating service](https://github.com/treeverse/lakeFS/pull/3373) is responsible for fetching, authenticating, and expanding the required templates and returning them to the client.
**Process**:
1. Get the S3 templates (locations should be specified in the incoming request). The files must be valid [`html/template`](https://pkg.go.dev/html/template) or [`text/template`](https://pkg.go.dev/text/template) parsable template texts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure incoming request should entirely specify the location. Would be happier if a set prefix were configured, and the incoming request could only select below that prefix. Otherwise admins will have a very tough time ensuring the access is safe.

The proposed template service never allows access by arbitrary path precisely for this reason. It protects access to templates on lakefs in some way; if you want to do the same but on S3, we will need to design similar protections for files from S3.

Comment on lines +240 to +254
2. Templating service- status 200
```json
{
"class": "templating_service",
"name": "successful_call",
"value": "<service name>"
}
```
3. Templating service - status 500 - no access to provided template location
```json
{
"class": "templating_service",
"name": "no_access",
"value": "<service name>"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we send status on our monitoring outputs as field values, rather than as metric names?

Comment on lines +256 to +271
4. Templating service - status 5xx - general
```json
{
"class": "templating_service",
"name": "5xx",
"value": "<service name>"
}
```
5. Templating service - status 4xx - general
```json
{
"class": "templating_service",
"name": "4xx",
"value": "<service name>"
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't the existing metrics on web endpoints already cover this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly. I just listed what we want, not what we need to implement.

3. The wizard creates a repo named ‘spark-repo’ and sets a default ‘main’ branch.
4. The wizard asks the users if they want to import existing data to lakeFS. The user skips this step using the skip button.
5. The wizard asks the user: “How are you using Spark”, and offers three alternatives: Databricks, AWS EMR, and spark-submit. The user chooses ‘Databricks’. It also asks the user for their lakeFS endpoint (and will show a default placeholder pointing to the current URL). The user types the endpoint.
6. The wizard sends a GET request to the templating service with a query string of the format:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, we might use the templating service as the source for the "how are you using this" data. It's some JSON, that needs to be generated somewhere, and may have some fields to specify. Looks like another use-case for templates.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Still open issues:

  • Rendering the README
  • Monitoring: can we use existing metrics rather than define new ones?


The [templating service](https://github.com/treeverse/lakeFS/pull/3373) is responsible for fetching, authenticating, and expanding the required templates and returning them to the client.
**Process**:
1. Get the template (the location should be specified in the incoming request). The file must be a valid [`html/template`](https://pkg.go.dev/html/template) or [`text/template`](https://pkg.go.dev/text/template) parsable template text.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we distinguish the two? I suggest the extension before .tt, say PATH.something.tt generates PATH.something and PATH.html.tt generates PATH.html.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

Comment on lines +123 to +124
4. The wizard asks the users if they want to import existing data to lakeFS. The user specifies the location of the bucket (after they validated that the lakeFS role has the right permissions and that the bucket has the correct policy), and clicks ‘OK’.
* An object counter will show the progress of the import process and will signal once it’s over.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like this to link to the import design. Or, if we need to change import, let's call it out and remain with one good implementation of the same feature.

Here, if a counter makes sense during the wizard, it also makes sense during a plain import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to change the import. We don't use it as it is, just leverage it's functionality, not the exact behavior.
It's not the same feature.
There is a counter in the plain import.

Copy link
Contributor

@itaiad200 itaiad200 Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UI-Import is an assembly of 5-6 different API calls. To get the behaviour you want here you'll need to replace 1-2 API calls, but it's all pretty straight forward.

* SparkEMR.conf.tt
* SparkDatabricks.conf.tt
* SparkXML.conf.tt
* README.md.tt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the generated README.md rendered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed that it will not be rendered at first.


### BI Metrics

Sent directly from the GUI Wizard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that all of these should identify which wizard it was. Initially we expect to have only one, but later we expect to have more!

Similarly, we might report which Spark configs were generated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They identify it as spark_wizard, did you mean something else?

@Jonathan-Rosenberg
Copy link
Contributor Author

@arielshaqed, regarding monitoring, if we can leverage existing metrics, we shall do that, if not we'll make amendments to get the knowledge we need. Anyway, this is an implementation detail.

2. [Figma Design](https://www.figma.com/file/haD9y599LzW6LvsYBI2xWU/Spark-use-case?node-id=31%3A200)
4. Generates a README file with all actions performed, and will commit it as a first commit (after the import) to the ‘main’ branch of the created repository.

### Templating Service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this service (a) bundled with the lakeFS server binary or (b) completely decoupled but shipped with the lakeFS binary or (c) a dedicated centrelized service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a

Comment on lines +125 to +129
5. The wizard asks the user for their lakeFS endpoint (and will show a default placeholder pointing to the current URL).
6. The wizard will send a GET request to the templating service with a query string of the format:
```
?lakefs_url=https://my-lakefs.io&template_location=databricksConfig.props.tt
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't the service infer the lakeFS url from the request itself? The UI knows the address of the server, right?
For trickier installation, where the address depends on the caller flow (i.e. with/without VPC), supplying the address gets harder for the user itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was that the URL will be set by default by the Wizard GUI (inferred from the URL it's working at), and we'll let the users change it if they want to.

(and will show a default placeholder pointing to the current URL).

```properties
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access_key=ACCESSKEYDONTTELL
spark.hadoop.fs.s3a.secret_key=SECRETKEYDONTTELL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kinds of break our promise when we create credentials: This is the only time where you can see your credentials - make sure to copy them....... I don't think that we shouldn't do it, only that we should probably rephrase that sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
Will add that

?lakefs_url=https://my-lakefs.io&template_location=README.md.tt
```
It will do so to generate a README.md file.
10. The returned README file will describe the steps taken, the configurations generated but without secrets and some commands to explain how to connect Hive Metastore to lakeFS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it know which steps were taken? If I run the same wizard 100 times, will it show 600 steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Wizard's intention is to create a repo ready to use. Each run will create a different repo (we'll not allow overriding repos) thus a wizard can run once on each repo, making the README specify the steps taken while creating that repo.

@Jonathan-Rosenberg Jonathan-Rosenberg merged commit 015241e into master Jun 21, 2022
@Jonathan-Rosenberg Jonathan-Rosenberg deleted the proposal/wizard branch June 21, 2022 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog team/ecosystem Team Ecosystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants