-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
76 changed files
with
902 additions
and
7 deletions.
There are no files selected for viewing
161 changes: 161 additions & 0 deletions
161
_sources/ancillary-topics/creating-a-SQL-view-in-HDFS.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Creating a SQL view in HDFS" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"A view in SQL is a way of storing a query by creating a virtual table. This can be useful for many reasons, one being that you can more easily get your desired output as it is simpler to access a view than typing a query from scratch repeatedly. For more information on SQL views please refer to [Views in SQL](https://learnsql.com/blog/sql-view/).\n", | ||
"\n", | ||
"As mentioned above, a SQL view can be used to avoid the re-running or pre-processing everytime you would like to use the same data (that has been created using a query) and avoids saving an entirely new table. A SQL view is simply a stored query that is applied to an underlying Hive table; **only** the metadata of the view will be stored and this happens after the query has been ran. Another advantage of a SQL view is that it refreshes in real-time, which means if the underlying table updates then your view will aswell, the reason being is that each time your view is called the statement used to create the view (from the underlying table) is re-ran. \n", | ||
"\n", | ||
"**Please note that a view is a virtual table**.\n", | ||
"\n", | ||
"## Creating a View\n", | ||
"Below we have an example of using SQL views. We will be working with an existing Hive table called `rescue_clean`. Upon this table we will create a view that involves some very simple processing of the data. To create a view we first need to login to HUE and navigate to where the Hive databases are stored (under the SQL tab on the left-hand-side); the database we are wanting is called `train_tmp` as this is where `rescue_clean` is stored. Please ensure that your editor type is set to Hive (click on the arrow beside the blue Query button and select \"Editor -> Hive\").\n", | ||
"\n", | ||
"Now we are in the right location, we can start to create a SQL view. To do this we use the `CREATE OR REPLACE VIEW` command in SQL: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "plaintext" | ||
} | ||
}, | ||
"source": [ | ||
"```SQL\n", | ||
"CREATE OR REPLACE rescue_view AS\n", | ||
" SELECT animal_group, ROUND(AVG(total_cost), 2) AS average_cost\n", | ||
" FROM rescue_clean\n", | ||
" GROUP BY animal_group\n", | ||
" ORDER BY average_cost DESC\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"![Details_of_the_SQL_rescue_view_as_seen_in_HDFS](../images/SQL_rescue_view_details.PNG)\n", | ||
"\n", | ||
"In the screenshot above it can be seen that `rescue_view` has been saved as a `VIRTUAL_VIEW` table type as opposed to being a `MANAGED_TABLE` or `EXTERNAL_TABLE`. Another useful way to tell if you have a view is that in your filestore an 'eye-icon' will appear next to `rescue_view`. \n", | ||
"\n", | ||
"Your view, or virtual table, can now be called by using `SELECT * FROM rescue_view LIMIT 5` as seen in the screenshot below:\n", | ||
"\n", | ||
"![Output_of_SELECT_statement_applied_to_SQL_rescue_view](../images/SQL_rescue_view_output.PNG)\n", | ||
"\n", | ||
"## Using Spark and SQL views\n", | ||
"As well as using HUE to interact with your view you can also access it from within a Spark session by reading it in in the same way as any other Hive table would be:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "plaintext" | ||
} | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"+--------------------+------------+\n", | ||
"| animal_group|average_cost|\n", | ||
"+--------------------+------------+\n", | ||
"| Goat| 1180.0|\n", | ||
"| Fish| 780.0|\n", | ||
"| Bull| 780.0|\n", | ||
"| Horse| 747.44|\n", | ||
"|Unknown - Animal ...| 709.67|\n", | ||
"| Cow| 624.17|\n", | ||
"| Hedgehog| 520.0|\n", | ||
"| Lamb| 520.0|\n", | ||
"| Deer| 423.88|\n", | ||
"|Unknown - Wild An...| 390.04|\n", | ||
"|Unknown - Heavy L...| 362.55|\n", | ||
"| Sheep| 359.25|\n", | ||
"| Dog| 336.54|\n", | ||
"| Fox| 333.26|\n", | ||
"| Cat| 329.18|\n", | ||
"| Bird| 325.71|\n", | ||
"| Snake| 322.38|\n", | ||
"|Unknown - Domesti...| 314.81|\n", | ||
"| Pigeon| 311.5|\n", | ||
"| Hamster| 311.07|\n", | ||
"+--------------------+------------+\n", | ||
"only showing top 20 rows\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from pyspark.sql import SparkSession\n", | ||
"\n", | ||
"spark = (\n", | ||
" SparkSession.builder.appName(\"spark_views\")\n", | ||
" .config(\"spark.ui.showConsoleProgress\", \"false\")\n", | ||
" .getOrCreate()\n", | ||
")\n", | ||
"\n", | ||
"df = spark.read.table('train_tmp.rescue_view')\n", | ||
"df.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Dropping a SQL view\n", | ||
"\n", | ||
"You may want to drop a view in SQL from the database, for example you no longer need it, you want to free up space or you want to over-write an existing view. To do the latter you need to drop the previously created view before creating another one with the same name. \n", | ||
"\n", | ||
"To drop an SQL view you simply write: `DROP VIEW view_name` so for us, in this example, it would be `DROP VIEW rescue_view`\n", | ||
"\n", | ||
"You can also use the following command: `DROP VIEW IF EXISTS view_name`. This code checked whether the view actually exists in your database and if so, it removes it. \n", | ||
"\n", | ||
"## Further resources \n", | ||
"\n", | ||
"SQL documentation \n", | ||
"* [Views in SQL](https://learnsql.com/blog/sql-view/)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.