diff --git a/examples/quickstart/jupyter-notebooks/druid-visualization.ipynb b/examples/quickstart/jupyter-notebooks/druid-visualization.ipynb new file mode 100644 index 0000000000000..02ca96d9f537f --- /dev/null +++ b/examples/quickstart/jupyter-notebooks/druid-visualization.ipynb @@ -0,0 +1,620 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ad4e60b6", + "metadata": { + "tags": [] + }, + "source": [ + "# Tutorial: Basic visualizations using the Druid API\n", + "\n", + "\n", + " \n", + "This tutorial introduces basic visualization options you can use with the Druid API. It focuses on two Python modules: [pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n", + "\n", + "\n", + "## Table of contents\n", + "\n", + "- [Prerequisites](#Prerequisites)\n", + "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n", + "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n", + "- [Display data with a line graph](#Display-data-with-a-line-graph)\n", + "- [Next steps](#Next-steps)\n", + "\n", + "For the best experience, use JupyterLab so that you can always access the table of contents." + ] + }, + { + "cell_type": "markdown", + "id": "8d6bbbcb", + "metadata": { + "tags": [] + }, + "source": [ + "## Prerequisites\n", + "\n", + "If you haven't already, you'll need install the Requests library for Python before you start. For example:\n", + "\n", + "```bash\n", + "pip3 install requests\n", + "```\n", + "\n", + "Additionally, install the pandas and Bokeh libraries. For example:\n", + "```bash\n", + "pip3 install pandas\n", + "pip3 install bokeh\n", + "```\n", + "\n", + "Next, you'll need a Druid cluster with data. This tutorial uses the `wikipedia_api` datasource from [Learn the basics of the Druid API](api-tutorial.ipynb).\n", + "\n", + "\n", + "Finally, you'll need either JupyterLab (recommended) or Jupyter notebook. Both the quickstart Druid cluster and Jupyter notebook are deployed at `localhost:8888` by default, so you'll \n", + "need to change the port for Jupyter. To do so, stop Jupyter and start it again with the `port` parameter included. For example, you can use the following command to start Jupyter on port `3001`:\n", + "\n", + "```bash\n", + "# If you're using JupyterLab\n", + "jupyter lab --port 3001\n", + "# If you're using Jupyter notebook\n", + "jupyter notebook --port 3001 \n", + "```\n", + "\n", + "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines `druid_host` for the host for the Druid Router service.The quickstart deployment configures the to listen on port `8888` by default, so you'll be making API calls against `http://localhost:8888`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7f08a52", + "metadata": {}, + "outputs": [], + "source": [ + "import requests, json\n", + "import pandas as pd\n", + "from bokeh.palettes import Spectral10\n", + "from bokeh.io import output_notebook, show\n", + "from bokeh.plotting import figure\n", + "\n", + "# druid_host is the hostname and port for your Druid deployment.\n", + "# In a distributed environment, use the Router service as the `druid_host`.\n", + "druid_host = \"http://localhost:8888\"\n", + "dataSourceName = \"wikipedia_api\"\n", + "print(f'Druid host: {druid_host}')" + ] + }, + { + "cell_type": "markdown", + "id": "2093ecf0-fb4b-405b-a216-094583580e0a", + "metadata": {}, + "source": [ + "In the rest of this tutorial, the `endpoint`, `http_method`, and `payload` variables are updated in code cells to call a different Druid endpoint to accomplish a task." + ] + }, + { + "cell_type": "markdown", + "id": "3b55af57-9c79-4e45-a22c-438c1b94112e", + "metadata": { + "tags": [] + }, + "source": [ + "## Display data in a DataFrame\n", + "\n", + "By default, when you query Druid using the API, Druid returns the results as JSON. The JSON results set is great for programmatic operations, but it is not easy to scan the data visually. They Python pandas module makes it simple to transform JSON query results to tabular format. In this section, you'll learn how to use pandas to display query results in a tabular format.\n" + ] + }, + { + "cell_type": "markdown", + "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f", + "metadata": { + "tags": [] + }, + "source": [ + "The following cell selects the top 10 channels by the number of additions and outputs the data as JSON." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7dd08f7-cba4-4f14-acd3-b5384dae7d88", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql\"\n", + "print(f'Query endpoint: {druid_host}{endpoint}')\n", + "\n", + "http_method = \"POST\"\n", + "query = \"SELECT channel, SUM(added) AS additions FROM wikipedia_api_v2 GROUP BY channel ORDER BY additions DESC LIMIT 10\"\n", + "\n", + "payload = json.dumps({\n", + " \"query\" : query\n", + "})\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "\n", + "print(f'Query: {query}')\n", + "print('Query results:')\n", + "print(json.dumps(response.json(), indent=4))\n" + ] + }, + { + "cell_type": "markdown", + "id": "950b2cc4-9935-497d-a3f5-e89afcc85965", + "metadata": { + "tags": [] + }, + "source": [ + "The following cell takes the JSON results from the Druid API and uses the pandas [`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html) method to convert the results to a pandas object. The pandas ojbect lets us display the results in a tabular format. The `orient` parameter sets pandas to accept data as a list of records in the format: `[{column: value, ...}, {column: value, ...}]`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa", + "metadata": {}, + "outputs": [], + "source": [ + "additions_pd = pd.read_json(json.dumps(response.json()), orient='records')\n", + "\n", + "additions_pd" + ] + }, + { + "cell_type": "markdown", + "id": "8c8f6e17-aa3e-4871-94d8-e4bed74ed516", + "metadata": {}, + "source": [ + "You can also load your results into a pandas DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67e6e09f-97ed-4b7c-b007-d239010f069c", + "metadata": {}, + "outputs": [], + "source": [ + "additions_df = pd.DataFrame.from_records(additions_pd)\n", + "\n", + "additions_df" + ] + }, + { + "cell_type": "markdown", + "id": "9488cbee-63bf-414a-aff3-207adf39e115", + "metadata": {}, + "source": [ + "One benefit of using a DataFrame is that you can access all the pandas `DataFrame` object methods. For example, `DataFrame.max()`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d0a4ada-c9cf-4eee-a55f-b524d2dfad64", + "metadata": {}, + "outputs": [], + "source": [ + "additions_df.max()" + ] + }, + { + "cell_type": "markdown", + "id": "a1720a88-cf1b-4022-981a-69441fac54fd", + "metadata": {}, + "source": [ + "Check out the [pandas docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html) for more ideas about how to use pandas and DataFrames \n", + "with your Druid data." + ] + }, + { + "cell_type": "markdown", + "id": "07a16047-dba1-4f45-9fcb-6130c37eca61", + "metadata": {}, + "source": [ + "## Display data with a bar graph\n", + "\n", + "Tabular format is OK for scanning data, however you can use visualization and plotting tools that work with Jupyter notebooks to visualize data as plots or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some basic plots using Druid data.\n", + "\n", + "In this section, you create a simple bar chart showing the channels with the most additions during the time range." + ] + }, + { + "cell_type": "markdown", + "id": "8420df27-0e80-48d1-a806-43f6bf81dd16", + "metadata": {}, + "source": [ + "First, call `output_notebook()` to set Bokeh to output plots inline within the notebook. This lets you view your plot inline. You can also use Bokeh's `output_file` function to write your plot to an HTML file. If you want to experiment with `output_file` add it to the list of imports." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d824a82f-9e71-43c4-b289-78d4970c5cfc", + "metadata": {}, + "outputs": [], + "source": [ + "output_notebook()" + ] + }, + { + "cell_type": "markdown", + "id": "5859d743-a5a7-445c-a06f-89b8c30856ed", + "metadata": {}, + "source": [ + "There are several ways to use Bokeh with a DataFrame. In this case, make a list of channels to serve as the x axis of our plot. For the y axis, divide the total additions by 100000 for ease of display." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70b40bfc-106d-462c-a5a1-51d0da4c4498", + "metadata": {}, + "outputs": [], + "source": [ + "channels = additions_df.channel.to_list()\n", + "total_additions = [x / 100000 for x in additions_df.additions.to_list()]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "efc5f142-8b68-4430-a666-2ed961d732b9", + "metadata": {}, + "outputs": [], + "source": [ + "channels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f619fd3-6a01-4dc6-b8eb-81e4b4ae1a31", + "metadata": {}, + "outputs": [], + "source": [ + "total_additions" + ] + }, + { + "cell_type": "markdown", + "id": "6b430b4c-df81-419d-a81e-63bb0150031b", + "metadata": {}, + "source": [ + "Next, initialize your Bokeh plot (figure) with some basic configurations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07644ccd-1f0c-4a50-8929-82559f9bf803", + "metadata": {}, + "outputs": [], + "source": [ + " # Create a new plot with a title. Set the size in pixels\n", + "bar_plot = figure(height=500, width=750, x_range=channels, title=\"100000 x Additions\",\n", + " toolbar_location=None)" + ] + }, + { + "cell_type": "markdown", + "id": "e0a01f4c-6f35-4eb7-9263-4f6ed6c68223", + "metadata": {}, + "source": [ + "Now, configure the rederer for the vertical bars on the plot:\n", + "- Set the x axis to `channels`, the list of channels.\n", + "- Set the `top` coordinate that determines the bar height to `totals`.\n", + "- For a splash of color, set the `color` to `Spectral10`. \n", + "\n", + "Note that palettes in Bokeh are lists of colors. Bokeh expects the list length to equal the list length of the data dimensions -- in this case 10 colors.\n", + "\n", + "See the Bokeh docs for more information on [vertical bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar) and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3e40279-6cf1-4e84-820a-2465e5e9d84a", + "metadata": {}, + "outputs": [], + "source": [ + "bar_plot.vbar(x=channels, top=total_additions, width=0.5, color=Spectral10)" + ] + }, + { + "cell_type": "markdown", + "id": "04a6581b-fa11-40ec-ba39-04d08e80dd09", + "metadata": {}, + "source": [ + "Leave off the x axis grid lines for this plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca348337-abdd-4e34-a81a-7872736f14ca", + "metadata": {}, + "outputs": [], + "source": [ + "bar_plot.xgrid.grid_line_color = None" + ] + }, + { + "cell_type": "markdown", + "id": "b5364774-b8e7-46a0-82d3-f1a9f09a6cc0", + "metadata": {}, + "source": [ + "Now, configure the y axis:\n", + " - Set the minimum value to `0`.\n", + " - Set the visible range to `(0,40)`.\n", + " \n", + "Then we can display our plot using the `show` function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5f83f5c-f546-4a71-be36-30f659a9b94b", + "metadata": {}, + "outputs": [], + "source": [ + "bar_plot.y_range.start = 0\n", + "y_range=(0, 40)" + ] + }, + { + "cell_type": "markdown", + "id": "8b54c25a-07af-4ade-96d5-15ca04178fe2", + "metadata": {}, + "source": [ + "Finally, display your plot with the `show()` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "44a604ec-208c-4d7a-9d25-97b308328407", + "metadata": {}, + "outputs": [], + "source": [ + "show(bar_plot)" + ] + }, + { + "cell_type": "markdown", + "id": "772a20f2-2ea9-4e67-96fa-0a1eab32bb5a", + "metadata": {}, + "source": [ + "## Display data with a line graph" + ] + }, + { + "cell_type": "markdown", + "id": "9fd43bc4-d048-4319-8dd0-2b0ca94fd33b", + "metadata": {}, + "source": [ + "In this section, you'll create a graph with a line the total additions for channels alongside the count of unique users as editors versus the number of edits flagged as a robot." + ] + }, + { + "cell_type": "markdown", + "id": "1f915e03-103c-4151-93c4-ad779741f2a5", + "metadata": {}, + "source": [ + "First, change the query to include the editors and robots and load the results into a new pandas DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "910920f5-db6a-4c3a-95fc-d25763d45de4", + "metadata": {}, + "outputs": [], + "source": [ + "query = \"\"\"\n", + " SELECT channel, SUM(added) AS additions,\n", + " COUNT (DISTINCT user) as editors,\n", + " SUM(CASE WHEN isRobot='true' THEN 1 ELSE 0 END) AS robots\n", + " FROM wikipedia_api_v2\n", + " GROUP BY channel ORDER BY additions DESC LIMIT 10\n", + " \"\"\"\n", + "\n", + "# Update the payload.\n", + "payload = json.dumps({\n", + " \"query\" : query\n", + "})\n", + "\n", + "# Submit the request.\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "\n", + "# Add the results to a pandas DataFrame.\n", + "editors_pd = pd.read_json(json.dumps(response.json()), orient='records')\n", + "editors_df = pd.DataFrame.from_records(editors_pd)\n", + "\n", + "editors_df" + ] + }, + { + "cell_type": "markdown", + "id": "5f761a0b-efe1-4e4a-bfa5-a80aca3e2cc8", + "metadata": {}, + "source": [ + "Set up values for your plots:\n", + "- Channels remains the basis for the x axis.\n", + "- This time, change the resolution for additions to 1000 so it will fit nicely on our plot with the editors and the robots values.\n", + "- Create lists of values for editors and robots." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "842b96f9-b22c-47ef-b050-fbc11b234296", + "metadata": {}, + "outputs": [], + "source": [ + "channels = editors_df.channel.to_list()\n", + "total_additions = [x / 1000 for x in editors_df.additions.to_list()]\n", + "editors = editors_df.editors.to_list()\n", + "robots = editors_df.robots.to_list()" + ] + }, + { + "cell_type": "markdown", + "id": "7ee812f8-fc6e-48b6-abf3-1f8cf87f07eb", + "metadata": {}, + "source": [ + "Next, create a plot the same as we did before, but this time we'll leave in the Bokeh tools so you can try those out at the end." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f59bb2ee-d513-492c-b021-58ce9e9e922b", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a plot.\n", + "line_plot = figure(x_range=channels, height=500, width=750, title=\"Editors vs robots\")" + ] + }, + { + "cell_type": "markdown", + "id": "50cfed5d-efcc-4429-9929-3235682cd869", + "metadata": {}, + "source": [ + "Let's change the scale a little bit." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20a4f394-898f-467c-b77e-38e95c45c380", + "metadata": {}, + "outputs": [], + "source": [ + "line_plot.y_range.start = 0\n", + "y_range=(0, 4000)" + ] + }, + { + "cell_type": "markdown", + "id": "6e4224f5-41a8-42b1-a3c9-cfc0bd96d3a5", + "metadata": {}, + "source": [ + "Now we'll just add our data lines onto the plot:\n", + "- For the line color, we only need one color per line. To keep with the Spectral pallete,set the index (0-9) for the color you want.\n", + "- Bokeh adds a legend for the lines. The legend label identifies each line in the legend.\n", + "- The `line_width` and `line_dash` properties control the line appearance.\n", + "\n", + "See the [Bokeh docs](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.line) for more information about the line figures." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a2f7688-a65b-4489-8f0f-13b5359d8d7c", + "metadata": {}, + "outputs": [], + "source": [ + "# Add renderers for the lines\n", + "addition_line = line_plot.line(channels, total_additions, line_color=Spectral10[0], legend_label=\"Additions x 1000\",line_width=3, line_dash='solid')\n", + "deletion_line = line_plot.line(channels, editors, line_color=Spectral10[3], legend_label=\"Editors\", line_width=3, line_dash='dashed')\n", + "robots_line = line_plot.line(channels, robots, line_color=Spectral10[7], legend_label=\"Robots\", line_width=3, line_dash='dotted')" + ] + }, + { + "cell_type": "markdown", + "id": "5c8e398e-b9ca-48e8-87ca-6191b3143245", + "metadata": {}, + "source": [ + "Finally, display the plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf8fec2b-e71a-4c54-81f5-b3c874d9ede0", + "metadata": {}, + "outputs": [], + "source": [ + "show(line_plot)" + ] + }, + { + "cell_type": "markdown", + "id": "963d9bdf-a9f2-455c-8c32-997f2df06827", + "metadata": {}, + "source": [ + "The graph shows some interesting data: channels with more bots editing topics tend to have fewer users editing the topcs. Those are some hard-working bots!\n", + "\n", + "Notice the toolbar to the right of the plot. You can use it to focus and navigate around the plot." + ] + }, + { + "cell_type": "markdown", + "id": "8fbfa1fa-2cde-46d5-8107-60bd436fb64e", + "metadata": { + "tags": [] + }, + "source": [ + "## Next steps\n", + "\n", + "This tutorial covers the absolute basics for visualization with Apache Druid, pandas, and Bokeh. See the following topics to learn more about the various visualizations you can build:\n", + "\n", + "- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n", + "- [pandas](https://pandas.pydata.org/)\n", + "- [Bokeh](https://bokeh.org/)\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + }, + "toc-autonumbering": false, + "toc-showcode": false, + "toc-showmarkdowntxt": false, + "vscode": { + "interpreter": { + "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}