-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wizard MVP proposal #3499
Wizard MVP proposal #3499
Conversation
design/open/ttv-wizard-mvp.md
Outdated
2. Imported data from <S3 location>. | ||
3. Generated the following configurations: | ||
<Spark configurations with [“<secret1>”, “<secret2>”...] hidden> | ||
4. Instructions to configure Hive Metastore with lakeFS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, which lakectl metastore
commands to run when creating/deleting/merging a branch/tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we wanted to have the following:
- Have a step in the Wizard that generates the Hive configurations snippet
- Include instructions for which lakectl commands to use when creating/deleting/merging a branch/tag (as @johnnyaug mentioned).
Do you think that the extra step is meaningless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point one will generate configurations that are exactly the same as the Spark configurations generated...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Can you please add a section that talks about where the templates are stored?
(forgot to hit submit yesterday...)
design/open/ttv-wizard-mvp.md
Outdated
2. Imported data from <S3 location>. | ||
3. Generated the following configurations: | ||
<Spark configurations with [“<secret1>”, “<secret2>”...] hidden> | ||
4. Instructions to configure Hive Metastore with lakeFS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we wanted to have the following:
- Have a step in the Wizard that generates the Hive configurations snippet
- Include instructions for which lakectl commands to use when creating/deleting/merging a branch/tag (as @johnnyaug mentioned).
Do you think that the extra step is meaningless?
design/open/ttv-wizard-mvp.md
Outdated
2. Generated the following configurations: | ||
<Spark configurations with [“<secret1>”, “<secret2>”...] hidden> | ||
3. Instructions to configure Hive Metastore with lakeFS. | ||
4. Generated this README file and committed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove step number 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NEAT!
How do we render the README? I assume this requires an additional feature.
|
||
The wizard UI component is responsible for the user’s Spark onboarding process. The process is as follows: | ||
1. Create a repository (named as the user wishes) and a ‘main’ branch in it. | ||
2. Import the user’s data to ‘main’ and display a progress bar (which will show a link to required permissions). Only available on cloud-deployed lakeFS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only cloud-deployed? I suspect it will only add work to enforce this
design/open/ttv-wizard-mvp.md
Outdated
|
||
The [templating service](https://github.com/treeverse/lakeFS/pull/3373) is responsible for fetching, authenticating, and expanding the required templates and returning them to the client. | ||
**Process**: | ||
1. Get the S3 templates (locations should be specified in the incoming request). The files must be valid [`html/template`](https://pkg.go.dev/html/template) or [`text/template`](https://pkg.go.dev/text/template) parsable template texts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure incoming request should entirely specify the location. Would be happier if a set prefix were configured, and the incoming request could only select below that prefix. Otherwise admins will have a very tough time ensuring the access is safe.
The proposed template service never allows access by arbitrary path precisely for this reason. It protects access to templates on lakefs in some way; if you want to do the same but on S3, we will need to design similar protections for files from S3.
2. Templating service- status 200 | ||
```json | ||
{ | ||
"class": "templating_service", | ||
"name": "successful_call", | ||
"value": "<service name>" | ||
} | ||
``` | ||
3. Templating service - status 500 - no access to provided template location | ||
```json | ||
{ | ||
"class": "templating_service", | ||
"name": "no_access", | ||
"value": "<service name>" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we send status on our monitoring outputs as field values, rather than as metric names?
4. Templating service - status 5xx - general | ||
```json | ||
{ | ||
"class": "templating_service", | ||
"name": "5xx", | ||
"value": "<service name>" | ||
} | ||
``` | ||
5. Templating service - status 4xx - general | ||
```json | ||
{ | ||
"class": "templating_service", | ||
"name": "4xx", | ||
"value": "<service name>" | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't the existing metrics on web endpoints already cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly. I just listed what we want, not what we need to implement.
3. The wizard creates a repo named ‘spark-repo’ and sets a default ‘main’ branch. | ||
4. The wizard asks the users if they want to import existing data to lakeFS. The user skips this step using the skip button. | ||
5. The wizard asks the user: “How are you using Spark”, and offers three alternatives: Databricks, AWS EMR, and spark-submit. The user chooses ‘Databricks’. It also asks the user for their lakeFS endpoint (and will show a default placeholder pointing to the current URL). The user types the endpoint. | ||
6. The wizard sends a GET request to the templating service with a query string of the format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optionally, we might use the templating service as the source for the "how are you using this" data. It's some JSON, that needs to be generated somewhere, and may have some fields to specify. Looks like another use-case for templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Still open issues:
- Rendering the README
- Monitoring: can we use existing metrics rather than define new ones?
design/open/ttv-wizard-mvp.md
Outdated
|
||
The [templating service](https://github.com/treeverse/lakeFS/pull/3373) is responsible for fetching, authenticating, and expanding the required templates and returning them to the client. | ||
**Process**: | ||
1. Get the template (the location should be specified in the incoming request). The file must be a valid [`html/template`](https://pkg.go.dev/html/template) or [`text/template`](https://pkg.go.dev/text/template) parsable template text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we distinguish the two? I suggest the extension before .tt
, say PATH.something.tt
generates PATH.something
and PATH.html.tt
generates PATH.html
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool
4. The wizard asks the users if they want to import existing data to lakeFS. The user specifies the location of the bucket (after they validated that the lakeFS role has the right permissions and that the bucket has the correct policy), and clicks ‘OK’. | ||
* An object counter will show the progress of the import process and will signal once it’s over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like this to link to the import design. Or, if we need to change import, let's call it out and remain with one good implementation of the same feature.
Here, if a counter makes sense during the wizard, it also makes sense during a plain import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to change the import. We don't use it as it is, just leverage it's functionality, not the exact behavior.
It's not the same feature.
There is a counter in the plain import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UI-Import is an assembly of 5-6 different API calls. To get the behaviour you want here you'll need to replace 1-2 API calls, but it's all pretty straight forward.
* SparkEMR.conf.tt | ||
* SparkDatabricks.conf.tt | ||
* SparkXML.conf.tt | ||
* README.md.tt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the generated README.md
rendered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed that it will not be rendered at first.
|
||
### BI Metrics | ||
|
||
Sent directly from the GUI Wizard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that all of these should identify which wizard it was. Initially we expect to have only one, but later we expect to have more!
Similarly, we might report which Spark configs were generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They identify it as spark_wizard
, did you mean something else?
@arielshaqed, regarding monitoring, if we can leverage existing metrics, we shall do that, if not we'll make amendments to get the knowledge we need. Anyway, this is an implementation detail. |
…using the suffix of the file's name
2. [Figma Design](https://www.figma.com/file/haD9y599LzW6LvsYBI2xWU/Spark-use-case?node-id=31%3A200) | ||
4. Generates a README file with all actions performed, and will commit it as a first commit (after the import) to the ‘main’ branch of the created repository. | ||
|
||
### Templating Service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this service (a) bundled with the lakeFS server binary or (b) completely decoupled but shipped with the lakeFS binary or (c) a dedicated centrelized service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a
5. The wizard asks the user for their lakeFS endpoint (and will show a default placeholder pointing to the current URL). | ||
6. The wizard will send a GET request to the templating service with a query string of the format: | ||
``` | ||
?lakefs_url=https://my-lakefs.io&template_location=databricksConfig.props.tt | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't the service infer the lakeFS url from the request itself? The UI knows the address of the server, right?
For trickier installation, where the address depends on the caller flow (i.e. with/without VPC), supplying the address gets harder for the user itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention was that the URL will be set by default by the Wizard GUI (inferred from the URL it's working at), and we'll let the users change it if they want to.
(and will show a default placeholder pointing to the current URL).
```properties | ||
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem | ||
spark.hadoop.fs.s3a.access_key=ACCESSKEYDONTTELL | ||
spark.hadoop.fs.s3a.secret_key=SECRETKEYDONTTELL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kinds of break our promise when we create credentials: This is the only time where you can see your credentials - make sure to copy them......
. I don't think that we shouldn't do it, only that we should probably rephrase that sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
Will add that
?lakefs_url=https://my-lakefs.io&template_location=README.md.tt | ||
``` | ||
It will do so to generate a README.md file. | ||
10. The returned README file will describe the steps taken, the configurations generated but without secrets and some commands to explain how to connect Hive Metastore to lakeFS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it know which steps were taken? If I run the same wizard 100 times, will it show 600 steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Wizard's intention is to create a repo ready to use. Each run will create a different repo (we'll not allow overriding repos) thus a wizard can run once on each repo, making the README specify the steps taken while creating that repo.
This is a proposal for the time to value wizard MVP.
For further info please refer to #3411, #3373, and the Figma design.