Skip to content

Conversation

@Yancey0623
Copy link
Collaborator

@Yancey0623 Yancey0623 commented Sep 19, 2019

fixed #835

Copy link
Collaborator

@tonyyang-svail tonyyang-svail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the reader of this document is the user of SQLFlow, so the document should focus on deploying SQLFlow with their existing Hive service. While the content looks more like "Hi SQLFlow developers, here are two examples of setting up SQLFlow - Hive testing environment".

We should assume the reader is not familiar with SQLFlow, so every new concept needs a detailed explanation. For example, when we are mentioning --datasource='hive://root:root@localhost:10000/', we should explain the data source is in the format of hive://user:password@address/database?param_n=arg_n, so that the user can adjust according to his/her own use case.


This document is a tutorial on how to run SQLFlow, which connects to the hive server2.

For the most production environment, the system administrators may setup hive server with [authentication configuration](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-Authentication/SecurityConfiguration): e.g., KERBEROS, LDAP, PAM, or CUSTOM.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we supported KERBEROS, LDAP, ...? If we have supported them all, we should provide specific examples.

Copy link
Collaborator Author

@Yancey0623 Yancey0623 Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sql-machine-learning/gohive use beltran/gohive to connect hiveserver2 with SASLTransport, I think we need to add more test in sql-machine-learning/gohive to make sure it works well in SQLFlow, before that we can leave one example (PAM auth) here.

Test SQLFlow by running a query in Jupyter Notebook

``` bash
> docker run --rm --net=container:hive sqlflow/sqlflow \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add -p 8888:8888 so that the Jupyter server in the container can be accessed by the browser on the host?

Copy link
Collaborator Author

@Yancey0623 Yancey0623 Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sqlflow Docker container shared the network stack of hive container by --net=container:hive and the hive container exposed the port to host -p 8888:8888.

@tonyyang-svail
Copy link
Collaborator

tonyyang-svail commented Sep 19, 2019

@Yancey1989 Maybe the following structure would help.


Run SQLFlow with Hive via HiveServer2

This tutorial explains how to connect SQLFlow with Hive via HiveServer2. It has been tested on the Hive version [...].

Connect existing Hive Service

To connect an existing Hive Service, we only need to configure a data source string in the format of

hive://user:password@address/dbname[?param1=value1&...&paramN=valueN]

The data source string contains the credential and the configurations for connecting Hive. For example, if we want to connect the database iris at 127.0.0.1:10000 using root as username and password, we can write hive://root:root@127.0.0.1:10000/iris as the data source string. You can find more configurations options at gohive.

Using the data source string, we can start an all-in-one SQLFlow container by running

docker run --rm -p 8888:8888 sqlflow/sqlflow bash -c \
"sqlflowserver --datasource='hive://root:root@localhost:10000/iris' &
SQLFLOW_SERVER=localhost:50051 jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''"

Then we can open a web browser and go to localhost:8888. There are many SQLFlow tutorials, e.g. tutorial_dnn_iris.ipynb. We can follow the tutorials and substitute the data for our own use.

Connect standalone Hive server for testing

We also pack a Hive server in a Docker image for testing. ....


@typhoonzero @weiguoz Maybe we should let MySQL and MaxCompute tutorials follow a similar structure?

@@ -0,0 +1,51 @@
# Run SQLFlow with Hive via HiveServer2
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to connect Hive with SQLFlow

@Yancey0623
Copy link
Collaborator Author

Thanks @tonyyang-svail , I updated this PR followed your comment.

@wangkuiyi wangkuiyi added the doc Document related request/bug label Sep 20, 2019
Copy link
Collaborator

@tonyyang-svail tonyyang-svail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

To connect an existing Hive server instance, we only need to configure a `datasource` string in the format of

``` text
hive://user:password@ip:port/dbname[?auth=<auth_mechanism>&session.<cfg_key1>=<cfg_value1>...&session<cfg_keyN>=valueN]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing dot: session<cfg_keyN>=valueN]=>session.<cfg_keyN>=valueN]

@Yancey0623 Yancey0623 merged commit 945c811 into sql-machine-learning:develop Sep 22, 2019
@Yancey0623 Yancey0623 deleted the hs2_tutorial branch September 22, 2019 13:48
shendiaomo pushed a commit to shendiaomo/sqlflow that referenced this pull request Oct 22, 2019
* run sqlflow with hive

* add how sqlflow connects with hive tutorial
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Document related request/bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a tutorial that introduces how to run SQLFLow with Hive via HiveServer2

5 participants