From 634838ed6562a06114dddebbf7e4c4afb5414708 Mon Sep 17 00:00:00 2001 From: Peter Marshall Date: Fri, 23 Feb 2024 11:29:30 +0000 Subject: [PATCH 1/4] Create 03-druid-metrics.ipynb First draft copy forward --- .../05-operations/03-druid-metrics.ipynb | 531 ++++++++++++++++++ 1 file changed, 531 insertions(+) create mode 100644 notebooks/05-operations/03-druid-metrics.ipynb diff --git a/notebooks/05-operations/03-druid-metrics.ipynb b/notebooks/05-operations/03-druid-metrics.ipynb new file mode 100644 index 00000000..75d8b039 --- /dev/null +++ b/notebooks/05-operations/03-druid-metrics.ipynb @@ -0,0 +1,531 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c9fc0614-4eec-4c39-ad22-ea10afa00e9d", + "metadata": {}, + "source": [ + "# Apache Druid metrics\n", + "\n", + "Metrics give you insight into why your Druid instance is performing in the way that it is.\n", + "\n", + "In this notebook you will take a tour of the out-of-the-box configurations for metrics in Apache Druid, and use some simple terminal commands to inspect them." + ] + }, + { + "cell_type": "markdown", + "id": "cd9a55db-61ff-4fa4-9655-2fef292ed2aa", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "This tutorial works with Druid 29.0.0 or later.\n", + "\n", + "Launch this tutorial and all prerequisites using the `jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).\n", + "\n", + "__DO NOT__ use the `jupyter-druid` profile with this tutorial as it will conflict with your locally running copy." + ] + }, + { + "cell_type": "markdown", + "id": "151ede96-348f-423c-adc7-0e23c6910b9e", + "metadata": {}, + "source": [ + "## Initialization\n", + "\n", + "To use this notebook, you will need to run Druid locally.\n", + "\n", + "You will also make extensive use of the terminal, which you can place alongside this notebook or on another screen.\n", + "\n", + "### Install tools\n", + "\n", + "Open a local terminal window.\n", + "\n", + "If you have not already install `wget` or `multitail`, run the following commands to install these tools using `brew`.\n", + "\n", + "```bash\n", + "brew install multitail ; brew install wget\n", + "```\n", + "\n", + "Run the following command to pull the default configuration for `multitail` to your home folder. Do not run this command if you are already running `multitail` as it will overwrite your own configuration.\n", + "\n", + "```bash\n", + "curl https://raw.githubusercontent.com/halturin/multitail/master/multitail.conf > ~/.multitailrc\n", + "```\n", + "\n", + "### Install Apache Druid\n", + "\n", + "Run the following to create a dedicated folder for learn-druid in your home directory:\n", + "\n", + "```bash\n", + "cd ~ ; mkdir learn-druid-local\n", + "cd learn-druid-local\n", + "```\n", + "\n", + "Now pull and decompress a compatible version of Druid.\n", + "\n", + "```bash\n", + "wget https://dlcdn.apache.org/druid/29.0.0/apache-druid-29.0.0-bin.tar.gz\n", + "tar -xzf apache-druid-29.0.0-bin.tar.gz\n", + "```\n", + "\n", + "Use the following commands to rename the folder.\n", + "\n", + "```bash\n", + "mv apache-druid-29.0.0 apache-druid\n", + "cd apache-druid\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "1d07feb7-92dd-42a9-ad6c-c89d5fdad8d5", + "metadata": {}, + "source": [ + "# Review the log file configuration\n", + "\n", + "By default, Apache Druid emits JVM metrics from all process.\n", + "\n", + "Metrics configuration is set in the `common.runtime.properties` file.\n", + "\n", + "Remembering that Druid has multiple configuration file locations out-of-the-box, run this command to view the `auto` configuration file for for metrics monitors:\n", + "\n", + "```bash\n", + "grep druid.monitoring ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", + "```\n", + "\n", + "A default [emission period](https://druid.apache.org/docs/latest/configuration/#enabling-metrics) of 1 minute is set in `druid.monitoring.emissionPeriod`. You may change this setting to a different ISO 8601 period to have metrics emitted at a different rate.\n", + "\n", + "The default configuration for Druid is to enable the `JvmMonitor` monitor. Later in this notebook you will enable [other monitors](https://druid.apache.org/docs/latest/configuration/#metrics-monitors). Note that some monitors are designed to work on specific processes, and that enabling monitors on unsupported processes will cause that process to fail during startup.\n", + "\n", + "Run the next command to view metrics emitter configurations:\n", + "\n", + "```bash\n", + "grep druid.emitter ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", + "```\n", + "\n", + "Notice that, by default, the `druid.emitter` is configured to `noop`, meaning that [no metrics are emitted](https://druid.apache.org/docs/latest/configuration/#metrics-emitters)." + ] + }, + { + "cell_type": "markdown", + "id": "ca62693b-50f4-4831-a092-dbc8e06d8191", + "metadata": {}, + "source": [ + "## Emitters\n", + "\n", + "In this section you will amend the emitter configuration in order to take a look at the JSON objects that contain the data and description for each metric.\n", + "\n", + "### Change the emitter module to Logging\n", + "\n", + "Enable the emission of metrics from your instance to the [log files](https://druid.apache.org/docs/latest/configuration/#logging-emitter-module) by running the following command.\n", + "\n", + "```bash\n", + "```\n", + "\n", + "When Druid starts, additional log entries will be created containing the JSON data for each metric.\n", + "\n", + "### Change the metrics emission rate\n", + "\n", + "\n", + "\n", + "### Start a Druid instance\n", + "\n", + "Start Druid with the following command:\n", + "\n", + "```bash\n", + "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "bec19737-2773-4c66-a05e-d0b1a707c4c8", + "metadata": {}, + "source": [ + "## Monitors\n", + "\n", + "\n", + "### Look at the standard log files\n", + "\n", + "Since Druid is a distributed system, we will find log files for each Druid process. In addition, Druid also captures the output written to the standard output. Use the following command to take a look at what is here:\n", + "\n", + "```bash\n", + "cd ~/learn-druid-local/apache-druid/log ; ls\n", + "```\n", + "\n", + "You can see how this results in two sets of files being created:\n", + "\n", + "* `process name.stdout.log` file - which is the information written by the processes to stdout (i.e., the terminal)\n", + "* `process name.log` - file containing various status, error, warning and debug messages" + ] + }, + { + "cell_type": "markdown", + "id": "5258912e-bf23-4acd-9267-a7f4ea3bc6f4", + "metadata": {}, + "source": [ + "### Log filenames\n", + "\n", + "The \"FileAppender\" `RollingRandomAccessFile` appender has both `fileName` and `filePattern` properties. The `fileName` property of the \"FileAppender\" sets the name of the log file being written to at the moment, while `filePattern` is applied when the log rolls over.\n", + "\n", + "Run the following command to change the default filename of the \"FileAppender\" in the log4j configuration file so that all future detailed process logs will have a name suffixed with the hostname.\n", + "\n", + "```bash\n", + "sed -i '' 's/{sys:druid.node.type}.log/{sys:druid.node.type}-${hostName}.log/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "List the contents of the log folder again to see the changes:\n", + "\n", + "```bash\n", + "ls\n", + "```\n", + "\n", + "Revert the change with the following command:\n", + "\n", + "```bash\n", + "sed -i '' 's/{sys:druid.node.type}-${hostName}.log/{sys:druid.node.type}.log/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Since \"FileAppender\" is a [RollingRandomAccessFileAppender](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender) you can adjust the `TimeBasedTriggeringPolicy` to change when `fileName` log files are rolled over to `filePattern` log files.\n", + "\n", + "Run the following in your terminal to adjust the `filePattern` to include the hours and minutes.\n", + "\n", + "```bash\n", + "sed -i '' 's/{yyyyMMdd}/{yyyyMMdd-HH:mm}/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Run this command to see the new file pattern:\n", + "\n", + "```bash\n", + "grep filePattern ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Since the `TimeBasedTriggeringPolicy` is set to 1 by default, a change in the least granular element of the `filePattern` will trigger a rollover. In this case, you will now see a new file being created every minute.\n", + "\n", + "Run the following command a few times - you will begin to see files being created every minute.\n", + "\n", + "```bash\n", + "ls -l\n", + "```\n", + "\n", + "Use other Log4j policies in the usual way. For example, you can adjust the `filePattern` to something like `%d{yyyyMMdd-HH:mm}-%i.log` and then use a [`SizeBasedTriggeringPolicy`](https://logging.apache.org/log4j/2.x/manual/appenders.html#sizebased-triggering-policy) instead of a `TimeBasedTriggeringPolicy` to have log files emitted when a log file hits a particular size, rather than based on the timestamp." + ] + }, + { + "cell_type": "markdown", + "id": "4cab489e-aa10-4482-a62d-3c106bdb06cc", + "metadata": {}, + "source": [ + "### Log patterns\n", + "\n", + "The content of log files is specified in the [`PatternLayout`](https://logging.apache.org/log4j/2.x/manual/layouts.html#pattern-layout).\n", + "\n", + "Run the following command to see the current Log4J configuration for `PatternLayout` for the \"Console\".\n", + "\n", + "```bash\n", + "xmllint -xpath Configuration/Appenders/Console/PatternLayout ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Each element refers to a specific piece of recordable information:\n", + "\n", + "* Timestamp (`%d{ISO8601}`)\n", + "* Log level (`%p`)\n", + "* Thread name (`[%t]`)\n", + "* Logger name (`%c`)\n", + "* Message (`%m%n`)\n", + "\n", + "Run this command to follow along as log entries are added to the process logs for the Coordinator, Overlord, and Broker.\n", + "\n", + "```bash\n", + "multitail -CS log4jnew -du -P a \\\n", + " -f coordinator-overlord.log \\\n", + " -f broker.log\n", + "```\n", + "\n", + "Developers assign different log levels to different entries in the log, indicating how severe a log entry is. In the log you will see entries with these levels:\n", + "\n", + "* FATAL (system failure)\n", + "* ERROR (functional failure)\n", + "* WARN (non-fatal issue)\n", + "* INFO (notable event)\n", + "* DEBUG (program debugging messages)\n", + "* TRACE (highly granular execution event)\n", + "\n", + "The thread name and logger name are helpful for diagnosis, especially for WARN and ERROR conditions.\n", + "\n", + "Messages describe what has happened, what the state of the process or some significant variable is. These serve not just for diagnosis but for general learning of how the Druid database works.\n", + "\n", + "In your `multitail` window, press `q` to quit and return to the terminal." + ] + }, + { + "cell_type": "markdown", + "id": "df2f17b3-f359-4df8-a682-7398e7d2adf8", + "metadata": {}, + "source": [ + "### Log retention\n", + "\n", + "Since \"FileAppender\" is a [RollingRandomAccessFileAppender](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender) you can adjust the `DefaultRolloverStrategy` to control retention of logs by adjusting the `Delete` section.\n", + "\n", + "`IfFileName` and `IfLastModified` are used in conjunction to remove any files from the log folder that match the rules. The default is to remove matching files older than two months, based on the date last modified.\n", + "\n", + "Run the following command to change the trigger policy to a [duration](https://logging.apache.org/log4j/2.x/javadoc/log4j-core/org/apache/logging/log4j/core/appender/rolling/action/Duration.html#parseCharSequence) of \"PT2M\". As a consequence, any files matching the `IfFileName` filter older than 2 minutes will be deleted automatically by Log4J.\n", + "\n", + "Looking back at your previous `ls -l` results, run this command and repeat to see how the `Delete` rule is applied.\n", + "\n", + "```bash\n", + "sed -i '' 's/IfLastModified age=\"7d\"/IfLastModified age=\"PT2M\"/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Notice how _all_ files are removed by the Overlord, not just rollover logs.\n", + "\n", + "```bash\n", + "ls\n", + "```\n", + "\n", + "Run this command to change the retention policy to one day.\n", + "\n", + "```bash\n", + "sed -i '' 's/IfLastModified age=\"PT2M\"/IfLastModified age=\"P1D\"/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Run the following to return the file pattern to the default.\n", + "\n", + "```bash\n", + "sed -i '' 's/{yyyyMMdd-HH:mm}/{yyyyMMdd}/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "4e6e8716-621c-4c94-90f2-360f568629d1", + "metadata": {}, + "source": [ + "## Loggers\n", + "\n", + "This section of the configuration controls what types of events are logged and what data is recorded.\n", + "\n", + "Before beginning, restart your Druid instance to recreate any logs that were deleted by the retention rules above.\n", + "\n", + "Run this command to kill your Druid instance.\n", + "\n", + "```bash\n", + "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", + "```\n", + "\n", + "Now restart Druid with the following command.\n", + "\n", + "```bash\n", + "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "da8e5519-88d7-4da1-982d-c8ba6fe4e83c", + "metadata": {}, + "source": [ + "### Logging level\n", + "\n", + "The out-of-the-box Druid configuration for Log4J sets a `Root` level for the `FileAppender` of `INFO`, meaning that only messages with a level of `INFO` and above will be recorded.\n", + "\n", + "Run this command to see the configuration for your instance.\n", + "\n", + "```bash\n", + "xmllint -xpath Configuration/Loggers/Root ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Other base levels are set at a class level, reducing log noise. For example:\n", + "\n", + "```xml\n", + " \n", + " \n", + " \n", + " \n", + "```\n", + "\n", + "Run this command to monitor several log files:\n", + "\n", + "```bash\n", + "multitail -CS log4jnew -du -P a -s 2 -sn 1,3 \\\n", + " -f coordinator-overlord.log \\\n", + " -f broker.log \\\n", + " -f middlemanager.log \\\n", + " -f historical.log\n", + "```\n", + "\n", + "Open a new terminal window and run this command to amend the base logging level for all Druid processes:\n", + "\n", + "```bash\n", + "sed -i '' 's/Root level=\"info\"/Root level=\"debug\"/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Within a short period of time you will see many more log messages appearing.\n", + "\n", + "Revert the logging level to INFO before proceeding.\n", + "\n", + "```bash\n", + "sed -i '' 's/Root level=\"debug\"/Root level=\"info\"/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "```\n", + "\n", + "Leave `multitail` running and your second terminal open." + ] + }, + { + "cell_type": "markdown", + "id": "7e0a46b8-77f4-4792-9484-9acba0407b3d", + "metadata": {}, + "source": [ + "## Examples\n", + "\n", + "Two recommended approaches for working with log files are:\n", + "\n", + "1. Search individual log files for indications of a problem, such as WARN, ERROR, and FATAL, and of Java exceptions in stack traces (e.g. `java.net.ConnectionException: Connection refused`.\n", + "2. Read log files more like a novel, starting at the beginning of the file - or some known intermediate point - and then follow the story logically.\n", + "\n", + "In approach 1, it's possible to work back through the history, using the time field as a key. You may notice large time gaps, or use filtering to remove any events from threads or classes that you know are not relevant to the error.\n", + "\n", + "Approach 2 requires more time, but helps to solve more complex problems. It's also an important learning aid when delving deep into how Druid's processes collaborate to realise services.\n", + "\n", + "In this section you'll find examples of log files being useful in diagnosing issues with Druid." + ] + }, + { + "cell_type": "markdown", + "id": "f9137fc3-9f53-4887-a2b2-b8cabb061cc5", + "metadata": {}, + "source": [ + "### Example: failure in Apache Zookeeper\n", + "\n", + "Druid's process rely on [Apache Zookeeper](https://zookeeper.apache.org/) for [inter-process communication and configuration](https://druid.apache.org/docs/latest/dependencies/zookeeper.html). Run the following command to simulate a failure in Zookeeper.\n", + "\n", + "Run the following command in your terminal window.\n", + "* The first `kill` prevents Zookeeper from being recovered by the `learn-druid` script's `supervisor`.\n", + "* The second `kill` stops the Zookeeper process.\n", + "\n", + "```bash\n", + "kill -STOP $(ps -ef | grep 'perl' | awk 'NF{print $2}' | head -n 1)\n", + "kill $(ps -ef | grep 'zoo' | awk 'NF{print $2}' | head -n 1)\n", + "```\n", + "\n", + "In your `multitail` window you will now see multiple `INFO` messages relating to Zookeeper connection issues, such as:\n", + "\n", + "```\n", + "2024-02-22T11:54:31,662 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Session 0x10001277b290002 for server localhost/[0:0:0:0:0:0:0:1]:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.\n", + "\n", + "java.net.ConnectException: Connection refused\n", + "```\n", + "\n", + "Run the command below to restart the \"supervisor\", which will restart Zookeeper automatically.\n", + "\n", + "```bash\n", + "kill -CONT $(ps -ef | grep 'perl' | awk 'NF{print $2}' | head -n 1)\n", + "```\n", + "\n", + "You can use the `-e` switch on `multitail` to filter the logs that are displayed. For example, try closing your `multitail` window with `q` and run the following command to filter logs to only show WARN messages i the Coordinator and Overlord, and to ignore all INFO messages on the Broker, Middle Manager, and Historical. You may then want to repeat the above.\n", + "\n", + "```bash\n", + "multitail -CS log4jnew -du -P a -s 2 -sn 1,3 \\\n", + " -e \"WARN\" -f coordinator-overlord.log \\\n", + " -ev \"INFO\" -f broker.log \\\n", + " -ev \"INFO\" -f middlemanager.log \\\n", + " -ev \"INFO\" -f historical.log\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "1023e5a3-cedc-435f-a448-30f8c3ae1535", + "metadata": {}, + "source": [ + "# Clean up\n", + "\n", + "Run this command to stop Druid.\n", + "\n", + "```bash\n", + "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", + "```\n", + "\n", + "Remove the `learn-druid-local` folder from your home folder in the usual way." + ] + }, + { + "cell_type": "markdown", + "id": "57d5921b-636e-4a85-ae4d-dc0239bfef0b", + "metadata": {}, + "source": [ + "## Learn more\n", + "\n", + "You've seen how Log4J is used by Apache Druid, and about the default configurations.\n", + "\n", + "For more information:\n", + "\n", + "* Watch [Druid optimizations for scaling customer facing analytics at Conviva](https://youtu.be/zkHXr-3GFJw?t=746) by Amir Youssefi and Pawas Ranjan from Conviva that describes how useful this information can be to tuning Druid clusters.\n", + "* Read more about Druid [logging](https://druid.apache.org/docs/latest/configuration/logging.html) in the official documentation.\n", + "* See more ways to use and run `multitail` on the [official site](https://www.vanheusden.com/multitail/index.html).\n", + "* Read more about log4j in [Apache Logging Services](https://logging.apache.org/) documentation.\n", + "* Read your options for `filePattern`s by refering to the [simple date format](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) documentation.\n", + "* For different `PatternLayout`s refer to the [pattern layout](https://logging.apache.org/log4j/2.x/manual/layouts.html#pattern-layout) documentation." + ] + }, + { + "cell_type": "markdown", + "id": "4c081295-750e-41a3-9693-7fe50af401fa", + "metadata": {}, + "source": [ + "Notice that the Logger Name shows a fully-qualified class name. Here are some examples of searches you may want to conduct on your logs to deepen your knowledge of what your instance is doing.\n", + "\n", + "| Log | Search Term |\n", + "|---|---|\n", + "| Any | __NodeRoleWatcher__
Across all the processes, watch as they detect changes in the processes that are running in the cluster, and see what they do about it! |\n", + "| Any | __org.apache.druid.initialization.Initialization__
These messages are all about the process starting up. It can be interesting to see what exactly each one does – and if it runs into issues. |\n", + "| Coordinator / Overlord | __org.apache.druid.metadata.SQLMetadataRuleManager__
This is the coordinator polling the rules in the metadata database, getting ready to apply them. The log tells you how many rules it picks up and how many data sources they cover. |\n", + "| Coordinator / Overlord | __org.apache.druid.metadata.SqlSegmentsMetadataManager__
Messages show how many segments the cluster thinks are “used” – ready to be used for queries. |\n", + "| Coordinator / Overlord | __org.apache.druid.indexing.overlord.RemoteTaskRunner__
GIves interesting information about what’s happening with ingestion resources in the cluster, including when they first advertise themselves. |\n", + "| Coordinator / Overlord | __org.apache.druid.server.coordinator.rules__
Lots of information about how the retention rules are actually being applied (or not!). |\n", + "| Coordinator / Overlord | __org.apache.druid.server.coordinator.duty.BalanceSegments__
Here you can see what Druid decides to do when balancing needs to be carried out – e.g. a server is lost or added. |\n", + "| Historical | __org.apache.druid.server.coordination.BatchDataSegmentAnnouncer__
You can see individual segments being announced as available for query by each historical server as it loads them. |\n", + "| Historical | __org.apache.druid.server.coordination.SegmentLoadDropHandler__
As well as seeing how the historical checks its local segment cache on startup, you can watch along as the Historical picks up the instructions from the coordinator and then does something about them. When there are ERRORs like “Failed to load segment for dataSource” you get traces about what the process was trying to do – quite often something pointing to an error with its connection to deep storage. |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa902836-0738-4290-b3e9-76ba2c7d53d0", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 529a4ea5033e8c094ad8ab5555d43115cec308c5 Mon Sep 17 00:00:00 2001 From: Peter Marshall Date: Fri, 23 Feb 2024 12:50:13 +0000 Subject: [PATCH 2/4] Update 03-druid-metrics.ipynb Added initial sections --- .../05-operations/03-druid-metrics.ipynb | 383 +++--------------- 1 file changed, 66 insertions(+), 317 deletions(-) diff --git a/notebooks/05-operations/03-druid-metrics.ipynb b/notebooks/05-operations/03-druid-metrics.ipynb index 75d8b039..698f78d9 100644 --- a/notebooks/05-operations/03-druid-metrics.ipynb +++ b/notebooks/05-operations/03-druid-metrics.ipynb @@ -41,16 +41,10 @@ "\n", "Open a local terminal window.\n", "\n", - "If you have not already install `wget` or `multitail`, run the following commands to install these tools using `brew`.\n", + "If you have not already install `wget`, run the following commands to install these tools using `brew`.\n", "\n", "```bash\n", - "brew install multitail ; brew install wget\n", - "```\n", - "\n", - "Run the following command to pull the default configuration for `multitail` to your home folder. Do not run this command if you are already running `multitail` as it will overwrite your own configuration.\n", - "\n", - "```bash\n", - "curl https://raw.githubusercontent.com/halturin/multitail/master/multitail.conf > ~/.multitailrc\n", + "brew install wget\n", "```\n", "\n", "### Install Apache Druid\n", @@ -82,29 +76,12 @@ "id": "1d07feb7-92dd-42a9-ad6c-c89d5fdad8d5", "metadata": {}, "source": [ - "# Review the log file configuration\n", - "\n", - "By default, Apache Druid emits JVM metrics from all process.\n", - "\n", - "Metrics configuration is set in the `common.runtime.properties` file.\n", - "\n", - "Remembering that Druid has multiple configuration file locations out-of-the-box, run this command to view the `auto` configuration file for for metrics monitors:\n", - "\n", - "```bash\n", - "grep druid.monitoring ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", - "```\n", - "\n", - "A default [emission period](https://druid.apache.org/docs/latest/configuration/#enabling-metrics) of 1 minute is set in `druid.monitoring.emissionPeriod`. You may change this setting to a different ISO 8601 period to have metrics emitted at a different rate.\n", - "\n", - "The default configuration for Druid is to enable the `JvmMonitor` monitor. Later in this notebook you will enable [other monitors](https://druid.apache.org/docs/latest/configuration/#metrics-monitors). Note that some monitors are designed to work on specific processes, and that enabling monitors on unsupported processes will cause that process to fail during startup.\n", - "\n", - "Run the next command to view metrics emitter configurations:\n", + "# Review the metrics configuration\n", "\n", - "```bash\n", - "grep druid.emitter ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", - "```\n", + "Metrics configuration is set in the `common.runtime.properties` file. This comprises:\n", "\n", - "Notice that, by default, the `druid.emitter` is configured to `noop`, meaning that [no metrics are emitted](https://druid.apache.org/docs/latest/configuration/#metrics-emitters)." + "* [Monitors](https://druid.apache.org/docs/latest/configuration/index.html#metrics-monitors), which extend Druid's built-in metrics.\n", + "* [Emitters](https://druid.apache.org/docs/latest/configuration/index.html#metrics-emitters), which push the metrics to a destination location." ] }, { @@ -112,335 +89,140 @@ "id": "ca62693b-50f4-4831-a092-dbc8e06d8191", "metadata": {}, "source": [ - "## Emitters\n", + "# Emitters\n", "\n", "In this section you will amend the emitter configuration in order to take a look at the JSON objects that contain the data and description for each metric.\n", "\n", - "### Change the emitter module to Logging\n", - "\n", - "Enable the emission of metrics from your instance to the [log files](https://druid.apache.org/docs/latest/configuration/#logging-emitter-module) by running the following command.\n", - "\n", - "```bash\n", - "```\n", - "\n", - "When Druid starts, additional log entries will be created containing the JSON data for each metric.\n", - "\n", - "### Change the metrics emission rate\n", - "\n", - "\n", - "\n", - "### Start a Druid instance\n", - "\n", - "Start Druid with the following command:\n", + "Remembering that Druid has multiple configuration file locations out-of-the-box, run this command to view the `auto` configuration file for emitters:\n", "\n", "```bash\n", - "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", - "```\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "bec19737-2773-4c66-a05e-d0b1a707c4c8", - "metadata": {}, - "source": [ - "## Monitors\n", - "\n", - "\n", - "### Look at the standard log files\n", - "\n", - "Since Druid is a distributed system, we will find log files for each Druid process. In addition, Druid also captures the output written to the standard output. Use the following command to take a look at what is here:\n", - "\n", - "```bash\n", - "cd ~/learn-druid-local/apache-druid/log ; ls\n", + "grep druid.emitter ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", "```\n", "\n", - "You can see how this results in two sets of files being created:\n", + "Notice that, by default, the `druid.emitter` is configured to `noop`, meaning that [no metrics are emitted](https://druid.apache.org/docs/latest/configuration/#metrics-emitters).\n", "\n", - "* `process name.stdout.log` file - which is the information written by the processes to stdout (i.e., the terminal)\n", - "* `process name.log` - file containing various status, error, warning and debug messages" - ] - }, - { - "cell_type": "markdown", - "id": "5258912e-bf23-4acd-9267-a7f4ea3bc6f4", - "metadata": {}, - "source": [ - "### Log filenames\n", + "## Change the emitter to Logging\n", "\n", - "The \"FileAppender\" `RollingRandomAccessFile` appender has both `fileName` and `filePattern` properties. The `fileName` property of the \"FileAppender\" sets the name of the log file being written to at the moment, while `filePattern` is applied when the log rolls over.\n", + "Enable the emission of metrics from your instance to the [log files](https://druid.apache.org/docs/latest/configuration/#logging-emitter-module) by setting the `druid.emitter` to logging.\n", "\n", - "Run the following command to change the default filename of the \"FileAppender\" in the log4j configuration file so that all future detailed process logs will have a name suffixed with the hostname.\n", + "Run the following command to update your configuration.\n", "\n", "```bash\n", - "sed -i '' 's/{sys:druid.node.type}.log/{sys:druid.node.type}-${hostName}.log/' \\\n", - " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "sed -i '' 's/druid.emitter=noop/druid.emitter=logging/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", "```\n", "\n", - "List the contents of the log folder again to see the changes:\n", + "Additional log entries will be created containing the JSON data for each metric according to the [Logging emitter](https://druid.apache.org/docs/latest/configuration/#logging-emitter-module) configuration. This includes the `druid.emitter.logging.logLevel` of INFO for these entries.\n", "\n", - "```bash\n", - "ls\n", - "```\n", + "## Start a Druid instance\n", "\n", - "Revert the change with the following command:\n", + "Start Druid with the following command:\n", "\n", "```bash\n", - "sed -i '' 's/{sys:druid.node.type}-${hostName}.log/{sys:druid.node.type}.log/' \\\n", - " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", "```\n", "\n", - "Since \"FileAppender\" is a [RollingRandomAccessFileAppender](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender) you can adjust the `TimeBasedTriggeringPolicy` to change when `fileName` log files are rolled over to `filePattern` log files.\n", - "\n", - "Run the following in your terminal to adjust the `filePattern` to include the hours and minutes.\n", + "## Look at the JSON metrics messages\n", "\n", "```bash\n", - "sed -i '' 's/{yyyyMMdd}/{yyyyMMdd-HH:mm}/' \\\n", - " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - [{]' ~/learn-druid-local/apache-druid/log/broker.log \\\n", + " | head -n 1 \\\n", + " | awk 'BEGIN {FS = \" - \"} ; {print $2}' \\\n", + " | jq 'select(has(\"type\") | not) | select(.feed == \"metrics\")'\n", "```\n", "\n", - "Run this command to see the new file pattern:\n", - "\n", "```bash\n", - "grep filePattern ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - [{]' apache-druid-24.0.2/log/broker.log \\\n", + " | awk 'BEGIN {FS = \" - \"} ; {print $2}' \\\n", + " | jq 'select(has(\"type\") | not) | select(.feed == \"metrics\") | \"\\(.timestamp) \\(.metric) \\(.value)\"' \\\n", + " | awk '{printf \"%17.17s - %-30s %s\\n\", $1, $2, $3}'\n", "```\n", "\n", - "Since the `TimeBasedTriggeringPolicy` is set to 1 by default, a change in the least granular element of the `filePattern` will trigger a rollover. In this case, you will now see a new file being created every minute.\n", - "\n", - "Run the following command a few times - you will begin to see files being created every minute.\n", + "## Ingest some data\n", "\n", "```bash\n", - "ls -l\n", "```\n", "\n", - "Use other Log4j policies in the usual way. For example, you can adjust the `filePattern` to something like `%d{yyyyMMdd-HH:mm}-%i.log` and then use a [`SizeBasedTriggeringPolicy`](https://logging.apache.org/log4j/2.x/manual/appenders.html#sizebased-triggering-policy) instead of a `TimeBasedTriggeringPolicy` to have log files emitted when a log file hits a particular size, rather than based on the timestamp." - ] - }, - { - "cell_type": "markdown", - "id": "4cab489e-aa10-4482-a62d-3c106bdb06cc", - "metadata": {}, - "source": [ - "### Log patterns\n", - "\n", - "The content of log files is specified in the [`PatternLayout`](https://logging.apache.org/log4j/2.x/manual/layouts.html#pattern-layout).\n", - "\n", - "Run the following command to see the current Log4J configuration for `PatternLayout` for the \"Console\".\n", - "\n", "```bash\n", - "xmllint -xpath Configuration/Appenders/Console/PatternLayout ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - [{]' apache-druid-24.0.2/log/broker.log \\\n", + " | awk 'BEGIN {FS = \" - \"} ; {print $2}' \\\n", + " | jq 'select(has(\"type\") | not) | select(.feed == \"metrics\") | \"\\(.timestamp) \\(.metric) \\(.value)\"' \\\n", + " | awk '{printf \"%17.17s - %-30s %s\\n\", $1, $2, $3}'\n", "```\n", "\n", - "Each element refers to a specific piece of recordable information:\n", - "\n", - "* Timestamp (`%d{ISO8601}`)\n", - "* Log level (`%p`)\n", - "* Thread name (`[%t]`)\n", - "* Logger name (`%c`)\n", - "* Message (`%m%n`)\n", - "\n", - "Run this command to follow along as log entries are added to the process logs for the Coordinator, Overlord, and Broker.\n", + "## Run a query\n", "\n", "```bash\n", - "multitail -CS log4jnew -du -P a \\\n", - " -f coordinator-overlord.log \\\n", - " -f broker.log\n", "```\n", "\n", - "Developers assign different log levels to different entries in the log, indicating how severe a log entry is. In the log you will see entries with these levels:\n", - "\n", - "* FATAL (system failure)\n", - "* ERROR (functional failure)\n", - "* WARN (non-fatal issue)\n", - "* INFO (notable event)\n", - "* DEBUG (program debugging messages)\n", - "* TRACE (highly granular execution event)\n", - "\n", - "The thread name and logger name are helpful for diagnosis, especially for WARN and ERROR conditions.\n", - "\n", - "Messages describe what has happened, what the state of the process or some significant variable is. These serve not just for diagnosis but for general learning of how the Druid database works.\n", - "\n", - "In your `multitail` window, press `q` to quit and return to the terminal." - ] - }, - { - "cell_type": "markdown", - "id": "df2f17b3-f359-4df8-a682-7398e7d2adf8", - "metadata": {}, - "source": [ - "### Log retention\n", - "\n", - "Since \"FileAppender\" is a [RollingRandomAccessFileAppender](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender) you can adjust the `DefaultRolloverStrategy` to control retention of logs by adjusting the `Delete` section.\n", - "\n", - "`IfFileName` and `IfLastModified` are used in conjunction to remove any files from the log folder that match the rules. The default is to remove matching files older than two months, based on the date last modified.\n", - "\n", - "Run the following command to change the trigger policy to a [duration](https://logging.apache.org/log4j/2.x/javadoc/log4j-core/org/apache/logging/log4j/core/appender/rolling/action/Duration.html#parseCharSequence) of \"PT2M\". As a consequence, any files matching the `IfFileName` filter older than 2 minutes will be deleted automatically by Log4J.\n", - "\n", - "Looking back at your previous `ls -l` results, run this command and repeat to see how the `Delete` rule is applied.\n", + "In this step, the jq command in the pipeline filters out all but the timestamp, label (i.e., metric) and value fields. The other fields maybe helpful for deeper dives into metrics, but in this exercise, we just want to get a sense of what changes when we add monitors.\n", "\n", "```bash\n", - "sed -i '' 's/IfLastModified age=\"7d\"/IfLastModified age=\"PT2M\"/' \\\n", - " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - [{]' apache-druid-24.0.2/log/broker.log \\\n", + " | awk 'BEGIN {FS = \" - \"} ; {print $2}' \\\n", + " | jq 'select(has(\"type\") | not) | select(.feed == \"metrics\") | \"\\(.timestamp) \\(.metric) \\(.value)\"' \\\n", + " | awk '{printf \"%17.17s - %-30s %s\\n\", $1, $2, $3}'\n", "```\n", "\n", - "Notice how _all_ files are removed by the Overlord, not just rollover logs.\n", + "[SQL metrics](https://druid.apache.org/docs/latest/operations/metrics/#sql-metrics) enable you to see the time taken to complete the query, planning time in miliseconds, and number of bytes returned `sqlQuery` metrics being emitted.\n", "\n", - "```bash\n", - "ls\n", - "```\n", - "\n", - "Run this command to change the retention policy to one day.\n", + "## Stop your cluster\n", "\n", "```bash\n", - "sed -i '' 's/IfLastModified age=\"PT2M\"/IfLastModified age=\"P1D\"/' \\\n", - " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", "```\n", "\n", - "Run the following to return the file pattern to the default.\n", + "## Change the emitter module to Kafka\n", "\n", - "```bash\n", - "sed -i '' 's/{yyyyMMdd-HH:mm}/{yyyyMMdd}/' \\\n", - " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "4e6e8716-621c-4c94-90f2-360f568629d1", - "metadata": {}, - "source": [ - "## Loggers\n", + "https://druid.apache.org/docs/latest/development/extensions-contrib/kafka-emitter\n", "\n", - "This section of the configuration controls what types of events are logged and what data is recorded.\n", + "## Set up a supervisor for the Kafka metrics\n", "\n", - "Before beginning, restart your Druid instance to recreate any logs that were deleted by the retention rules above.\n", + "## Query the Kafka metrics\n", "\n", - "Run this command to kill your Druid instance.\n", + "## Change the metrics emission rate\n", "\n", "```bash\n", - "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", + "sed -i '' 's/{sys:druid.node.type}.log/{sys:druid.node.type}-${hostName}.log/' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", "```\n", - "\n", - "Now restart Druid with the following command.\n", - "\n", - "```bash\n", - "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", - "```" + "\n" ] }, { "cell_type": "markdown", - "id": "da8e5519-88d7-4da1-982d-c8ba6fe4e83c", + "id": "bec19737-2773-4c66-a05e-d0b1a707c4c8", "metadata": {}, "source": [ - "### Logging level\n", - "\n", - "The out-of-the-box Druid configuration for Log4J sets a `Root` level for the `FileAppender` of `INFO`, meaning that only messages with a level of `INFO` and above will be recorded.\n", - "\n", - "Run this command to see the configuration for your instance.\n", - "\n", - "```bash\n", - "xmllint -xpath Configuration/Loggers/Root ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", - "```\n", - "\n", - "Other base levels are set at a class level, reducing log noise. For example:\n", - "\n", - "```xml\n", - " \n", - " \n", - " \n", - " \n", - "```\n", - "\n", - "Run this command to monitor several log files:\n", + "# Monitors\n", "\n", - "```bash\n", - "multitail -CS log4jnew -du -P a -s 2 -sn 1,3 \\\n", - " -f coordinator-overlord.log \\\n", - " -f broker.log \\\n", - " -f middlemanager.log \\\n", - " -f historical.log\n", - "```\n", - "\n", - "Open a new terminal window and run this command to amend the base logging level for all Druid processes:\n", + "In this section you will add to the list of `druid.monitoring.monitors` to extend the basic metrics into other areas.\n", "\n", - "```bash\n", - "sed -i '' 's/Root level=\"info\"/Root level=\"debug\"/' \\\n", - " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", - "```\n", - "\n", - "Within a short period of time you will see many more log messages appearing.\n", - "\n", - "Revert the logging level to INFO before proceeding.\n", + "Remembering that Druid has multiple configuration file locations out-of-the-box, run this command to view the `auto` configuration file for for metrics monitors:\n", "\n", "```bash\n", - "sed -i '' 's/Root level=\"debug\"/Root level=\"info\"/' \\\n", - " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml\n", + "grep druid.monitoring ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", "```\n", "\n", - "Leave `multitail` running and your second terminal open." - ] - }, - { - "cell_type": "markdown", - "id": "7e0a46b8-77f4-4792-9484-9acba0407b3d", - "metadata": {}, - "source": [ - "## Examples\n", - "\n", - "Two recommended approaches for working with log files are:\n", - "\n", - "1. Search individual log files for indications of a problem, such as WARN, ERROR, and FATAL, and of Java exceptions in stack traces (e.g. `java.net.ConnectionException: Connection refused`.\n", - "2. Read log files more like a novel, starting at the beginning of the file - or some known intermediate point - and then follow the story logically.\n", - "\n", - "In approach 1, it's possible to work back through the history, using the time field as a key. You may notice large time gaps, or use filtering to remove any events from threads or classes that you know are not relevant to the error.\n", + "The default configuration for Druid is to enable [JVM metrics](https://druid.apache.org/docs/latest/operations/metrics.html#jvm) from the `JvmMonitor` monitor.\n", "\n", - "Approach 2 requires more time, but helps to solve more complex problems. It's also an important learning aid when delving deep into how Druid's processes collaborate to realise services.\n", - "\n", - "In this section you'll find examples of log files being useful in diagnosing issues with Druid." - ] - }, - { - "cell_type": "markdown", - "id": "f9137fc3-9f53-4887-a2b2-b8cabb061cc5", - "metadata": {}, - "source": [ - "### Example: failure in Apache Zookeeper\n", + "> You may want to experiment by removing the `JvmMonitor` monitor from the default list of monitors. You will then see only the out-of-the-box metrics emitted by every Druid process.\n", "\n", - "Druid's process rely on [Apache Zookeeper](https://zookeeper.apache.org/) for [inter-process communication and configuration](https://druid.apache.org/docs/latest/dependencies/zookeeper.html). Run the following command to simulate a failure in Zookeeper.\n", - "\n", - "Run the following command in your terminal window.\n", - "* The first `kill` prevents Zookeeper from being recovered by the `learn-druid` script's `supervisor`.\n", - "* The second `kill` stops the Zookeeper process.\n", + "## Stop your cluster\n", "\n", "```bash\n", - "kill -STOP $(ps -ef | grep 'perl' | awk 'NF{print $2}' | head -n 1)\n", - "kill $(ps -ef | grep 'zoo' | awk 'NF{print $2}' | head -n 1)\n", - "```\n", - "\n", - "In your `multitail` window you will now see multiple `INFO` messages relating to Zookeeper connection issues, such as:\n", - "\n", + "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", "```\n", - "2024-02-22T11:54:31,662 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Session 0x10001277b290002 for server localhost/[0:0:0:0:0:0:0:1]:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.\n", "\n", - "java.net.ConnectException: Connection refused\n", - "```\n", + "## Add more monitors\n", "\n", - "Run the command below to restart the \"supervisor\", which will restart Zookeeper automatically.\n", + "Note that some monitors are designed to work on specific processes, and that enabling monitors on unsupported processes will cause that process to fail during startup.\n", "\n", - "```bash\n", - "kill -CONT $(ps -ef | grep 'perl' | awk 'NF{print $2}' | head -n 1)\n", - "```\n", + "## Increase the emission period\n", "\n", - "You can use the `-e` switch on `multitail` to filter the logs that are displayed. For example, try closing your `multitail` window with `q` and run the following command to filter logs to only show WARN messages i the Coordinator and Overlord, and to ignore all INFO messages on the Broker, Middle Manager, and Historical. You may then want to repeat the above.\n", + "There is a default [emission period](https://druid.apache.org/docs/latest/configuration/#enabling-metrics) of 1 minute. Add `druid.monitoring.emissionPeriod` to your configuration to have metrics emitted at a different rate.\n", "\n", - "```bash\n", - "multitail -CS log4jnew -du -P a -s 2 -sn 1,3 \\\n", - " -e \"WARN\" -f coordinator-overlord.log \\\n", - " -ev \"INFO\" -f broker.log \\\n", - " -ev \"INFO\" -f middlemanager.log \\\n", - " -ev \"INFO\" -f historical.log\n", - "```" + "## Query the metric data" ] }, { @@ -456,7 +238,7 @@ "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", "```\n", "\n", - "Remove the `learn-druid-local` folder from your home folder in the usual way." + "Delete the `learn-druid-local` folder from your home folder in the usual way." ] }, { @@ -470,41 +252,8 @@ "\n", "For more information:\n", "\n", - "* Watch [Druid optimizations for scaling customer facing analytics at Conviva](https://youtu.be/zkHXr-3GFJw?t=746) by Amir Youssefi and Pawas Ranjan from Conviva that describes how useful this information can be to tuning Druid clusters.\n", - "* Read more about Druid [logging](https://druid.apache.org/docs/latest/configuration/logging.html) in the official documentation.\n", - "* See more ways to use and run `multitail` on the [official site](https://www.vanheusden.com/multitail/index.html).\n", - "* Read more about log4j in [Apache Logging Services](https://logging.apache.org/) documentation.\n", - "* Read your options for `filePattern`s by refering to the [simple date format](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) documentation.\n", - "* For different `PatternLayout`s refer to the [pattern layout](https://logging.apache.org/log4j/2.x/manual/layouts.html#pattern-layout) documentation." + "* Try out all the other monitors that are available, remembering that some monitors are only applicable to specific processes, requiring you to modify the `runtime.properties` for those processes only." ] - }, - { - "cell_type": "markdown", - "id": "4c081295-750e-41a3-9693-7fe50af401fa", - "metadata": {}, - "source": [ - "Notice that the Logger Name shows a fully-qualified class name. Here are some examples of searches you may want to conduct on your logs to deepen your knowledge of what your instance is doing.\n", - "\n", - "| Log | Search Term |\n", - "|---|---|\n", - "| Any | __NodeRoleWatcher__
Across all the processes, watch as they detect changes in the processes that are running in the cluster, and see what they do about it! |\n", - "| Any | __org.apache.druid.initialization.Initialization__
These messages are all about the process starting up. It can be interesting to see what exactly each one does – and if it runs into issues. |\n", - "| Coordinator / Overlord | __org.apache.druid.metadata.SQLMetadataRuleManager__
This is the coordinator polling the rules in the metadata database, getting ready to apply them. The log tells you how many rules it picks up and how many data sources they cover. |\n", - "| Coordinator / Overlord | __org.apache.druid.metadata.SqlSegmentsMetadataManager__
Messages show how many segments the cluster thinks are “used” – ready to be used for queries. |\n", - "| Coordinator / Overlord | __org.apache.druid.indexing.overlord.RemoteTaskRunner__
GIves interesting information about what’s happening with ingestion resources in the cluster, including when they first advertise themselves. |\n", - "| Coordinator / Overlord | __org.apache.druid.server.coordinator.rules__
Lots of information about how the retention rules are actually being applied (or not!). |\n", - "| Coordinator / Overlord | __org.apache.druid.server.coordinator.duty.BalanceSegments__
Here you can see what Druid decides to do when balancing needs to be carried out – e.g. a server is lost or added. |\n", - "| Historical | __org.apache.druid.server.coordination.BatchDataSegmentAnnouncer__
You can see individual segments being announced as available for query by each historical server as it loads them. |\n", - "| Historical | __org.apache.druid.server.coordination.SegmentLoadDropHandler__
As well as seeing how the historical checks its local segment cache on startup, you can watch along as the Historical picks up the instructions from the coordinator and then does something about them. When there are ERRORs like “Failed to load segment for dataSource” you get traces about what the process was trying to do – quite often something pointing to an error with its connection to deep storage. |" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fa902836-0738-4290-b3e9-76ba2c7d53d0", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { From ccadbb8591771d17fa20c94b16561bcbff8512f4 Mon Sep 17 00:00:00 2001 From: Peter Marshall Date: Fri, 23 Feb 2024 14:38:37 +0000 Subject: [PATCH 3/4] Update 03-druid-metrics.ipynb Version ready for reviww --- .../05-operations/03-druid-metrics.ipynb | 166 +++++++++++------- 1 file changed, 104 insertions(+), 62 deletions(-) diff --git a/notebooks/05-operations/03-druid-metrics.ipynb b/notebooks/05-operations/03-druid-metrics.ipynb index 698f78d9..37cde88c 100644 --- a/notebooks/05-operations/03-druid-metrics.ipynb +++ b/notebooks/05-operations/03-druid-metrics.ipynb @@ -76,7 +76,7 @@ "id": "1d07feb7-92dd-42a9-ad6c-c89d5fdad8d5", "metadata": {}, "source": [ - "# Review the metrics configuration\n", + "# Metrics configuration\n", "\n", "Metrics configuration is set in the `common.runtime.properties` file. This comprises:\n", "\n", @@ -89,7 +89,7 @@ "id": "ca62693b-50f4-4831-a092-dbc8e06d8191", "metadata": {}, "source": [ - "# Emitters\n", + "## Emitters\n", "\n", "In this section you will amend the emitter configuration in order to take a look at the JSON objects that contain the data and description for each metric.\n", "\n", @@ -101,7 +101,7 @@ "\n", "Notice that, by default, the `druid.emitter` is configured to `noop`, meaning that [no metrics are emitted](https://druid.apache.org/docs/latest/configuration/#metrics-emitters).\n", "\n", - "## Change the emitter to Logging\n", + "### Change the emitter to Logging\n", "\n", "Enable the emission of metrics from your instance to the [log files](https://druid.apache.org/docs/latest/configuration/#logging-emitter-module) by setting the `druid.emitter` to logging.\n", "\n", @@ -114,7 +114,7 @@ "\n", "Additional log entries will be created containing the JSON data for each metric according to the [Logging emitter](https://druid.apache.org/docs/latest/configuration/#logging-emitter-module) configuration. This includes the `druid.emitter.logging.logLevel` of INFO for these entries.\n", "\n", - "## Start a Druid instance\n", + "### Start a Druid instance\n", "\n", "Start Druid with the following command:\n", "\n", @@ -122,107 +122,140 @@ "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", "```\n", "\n", - "## Look at the JSON metrics messages\n", + "### Look at the JSON metrics messages\n", + "\n", + "Run the following command to display the JSON being emitted to the log files.\n", + "\n", + "* `grep` finds only lines in log files related to metrics from the `LoggingEmitter`.\n", + "* The `cut` command returns only the 7th field in the data\n", + "* The result is made pretty through `jq`.\n", "\n", "```bash\n", - "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - [{]' ~/learn-druid-local/apache-druid/log/broker.log \\\n", - " | head -n 1 \\\n", - " | awk 'BEGIN {FS = \" - \"} ; {print $2}' \\\n", - " | jq 'select(has(\"type\") | not) | select(.feed == \"metrics\")'\n", + "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - \\[metrics\\]' ~/learn-druid-local/apache-druid/log/*.log \\\n", + " | cut -d ' ' -f 7- \\\n", + " | jq\n", "```\n", "\n", + "Run the command a few times to build up a good sample.\n", + "\n", + "You will see:\n", + "\n", + "* A timestamp for the event.\n", + "* The server hostname and type that emitted the metric together with its running version, such as \"druid/broker\" and \"29.0.0\".\n", + "* The metric name, such as \"serverview/init/time\".\n", + "* A value for the metric." + ] + }, + { + "cell_type": "markdown", + "id": "bec19737-2773-4c66-a05e-d0b1a707c4c8", + "metadata": {}, + "source": [ + "## Monitors\n", + "\n", + "Run this command to view the `auto` configuration file for for metrics monitors:\n", + "\n", "```bash\n", - "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - [{]' apache-druid-24.0.2/log/broker.log \\\n", - " | awk 'BEGIN {FS = \" - \"} ; {print $2}' \\\n", - " | jq 'select(has(\"type\") | not) | select(.feed == \"metrics\") | \"\\(.timestamp) \\(.metric) \\(.value)\"' \\\n", - " | awk '{printf \"%17.17s - %-30s %s\\n\", $1, $2, $3}'\n", + "grep druid.monitoring ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", "```\n", "\n", - "## Ingest some data\n", + "### Inspect some metrics\n", + "\n", + "The default configuration for Druid extends the basic metrics with:\n", + "\n", + "* [JVM metrics](https://druid.apache.org/docs/latest/operations/metrics.html#jvm) from the `JvmMonitor` monitor.\n", + "* Service heartbeats from the `ServiceStatusMonitor` monitor.\n", + "\n", + "Use the command below to see a specific JVM metric for your Coordinator process. You may want to run this command a few times to see what is happening.\n", "\n", "```bash\n", + "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - \\[metrics\\]' ~/learn-druid-local/apache-druid/log/*.log \\\n", + " | cut -d ' ' -f 7- \\\n", + " | jq 'select(.metric == \"jvm/pool/used\" and .service==\"druid/coordinator\")'\n", "```\n", "\n", + "Notice that this metric has additional dimensions, `poolKind` and `poolName`. Other monitors emit [other dimensions](https://druid.apache.org/docs/latest/operations/metrics).\n", + "\n", + "Run the following command to return your entire instance to the basic metrics for Druid:\n", + "\n", "```bash\n", - "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - [{]' apache-druid-24.0.2/log/broker.log \\\n", - " | awk 'BEGIN {FS = \" - \"} ; {print $2}' \\\n", - " | jq 'select(has(\"type\") | not) | select(.feed == \"metrics\") | \"\\(.timestamp) \\(.metric) \\(.value)\"' \\\n", - " | awk '{printf \"%17.17s - %-30s %s\\n\", $1, $2, $3}'\n", + "sed -i '' 's/\"org.apache.druid.java.util.metrics.JvmMonitor\", \"org.apache.druid.server.metrics.ServiceStatusMonitor\"//' \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", "```\n", "\n", - "## Run a query\n", + "Now restart your instance and - for the purpose of this exercise - clear down your logs.\n", "\n", "```bash\n", + "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", + "rm ~/learn-druid-local/apache-druid/log/*.log\n", + "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", "```\n", "\n", - "In this step, the jq command in the pipeline filters out all but the timestamp, label (i.e., metric) and value fields. The other fields maybe helpful for deeper dives into metrics, but in this exercise, we just want to get a sense of what changes when we add monitors.\n", + "Run this command to see the base metrics that are now being emitted:\n", "\n", "```bash\n", - "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - [{]' apache-druid-24.0.2/log/broker.log \\\n", - " | awk 'BEGIN {FS = \" - \"} ; {print $2}' \\\n", - " | jq 'select(has(\"type\") | not) | select(.feed == \"metrics\") | \"\\(.timestamp) \\(.metric) \\(.value)\"' \\\n", - " | awk '{printf \"%17.17s - %-30s %s\\n\", $1, $2, $3}'\n", + "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - \\[metrics\\]' ~/learn-druid-local/apache-druid/log/*.log \\\n", + " | cut -d ' ' -f 7- \\\n", + " | jq\n", "```\n", "\n", - "[SQL metrics](https://druid.apache.org/docs/latest/operations/metrics/#sql-metrics) enable you to see the time taken to complete the query, planning time in miliseconds, and number of bytes returned `sqlQuery` metrics being emitted.\n", + "### Add a process-specific monitor\n", + "\n", + "Some monitors are designed to work on specific processes. Enabling monitors on unsupported processes will cause that process to fail during startup. In this section you will add the [Historical](https://druid.apache.org/docs/latest/operations/metrics/#historical-1) monitor.\n", "\n", - "## Stop your cluster\n", + "First, stop your cluster with the following command:\n", "\n", "```bash\n", "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", "```\n", "\n", - "## Change the emitter module to Kafka\n", + "Now run this command to add the Historical monitor to your Historical process's `runtime.properties`.\n", "\n", - "https://druid.apache.org/docs/latest/development/extensions-contrib/kafka-emitter\n", - "\n", - "## Set up a supervisor for the Kafka metrics\n", - "\n", - "## Query the Kafka metrics\n", + "```bash\n", + "echo \"druid.monitoring.monitors=[\\\"org.apache.druid.server.metrics.HistoricalMetricsMonitor\\\"]\" >> \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/historical/runtime.properties\n", + "```\n", "\n", - "## Change the metrics emission rate\n", + "Start Druid with the following command:\n", "\n", "```bash\n", - "sed -i '' 's/{sys:druid.node.type}.log/{sys:druid.node.type}-${hostName}.log/' \\\n", - " ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", + "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", "```\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "bec19737-2773-4c66-a05e-d0b1a707c4c8", - "metadata": {}, - "source": [ - "# Monitors\n", - "\n", - "In this section you will add to the list of `druid.monitoring.monitors` to extend the basic metrics into other areas.\n", "\n", - "Remembering that Druid has multiple configuration file locations out-of-the-box, run this command to view the `auto` configuration file for for metrics monitors:\n", + "Now run this command to review some of the metrics data from the Historical:\n", "\n", "```bash\n", - "grep druid.monitoring ~/learn-druid-local/apache-druid/conf/druid/auto/_common/common.runtime.properties\n", + "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - \\[metrics\\]' ~/learn-druid-local/apache-druid/log/historical.log \\\n", + " | cut -d ' ' -f 7- \\\n", + " | jq\n", "```\n", "\n", - "The default configuration for Druid is to enable [JVM metrics](https://druid.apache.org/docs/latest/operations/metrics.html#jvm) from the `JvmMonitor` monitor.\n", + "### Increase the emission period\n", "\n", - "> You may want to experiment by removing the `JvmMonitor` monitor from the default list of monitors. You will then see only the out-of-the-box metrics emitted by every Druid process.\n", + "There is a default [emission period](https://druid.apache.org/docs/latest/configuration/#enabling-metrics) of 1 minute. Apply a `druid.monitoring.emissionPeriod` to your configuration to have metrics emitted at a different rate.\n", "\n", - "## Stop your cluster\n", + "Run this command to have the Historical process emit metrics every 15 seconds:\n", "\n", "```bash\n", - "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", + "echo \"druid.monitoring.emissionPeriod=PT15S\" >> \\\n", + " ~/learn-druid-local/apache-druid/conf/druid/auto/historical/runtime.properties\n", "```\n", "\n", - "## Add more monitors\n", - "\n", - "Note that some monitors are designed to work on specific processes, and that enabling monitors on unsupported processes will cause that process to fail during startup.\n", + "To apply the configuration, restart your instance:\n", "\n", - "## Increase the emission period\n", + "```bash\n", + "kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)\n", + "rm ~/learn-druid-local/apache-druid/log/*.log\n", + "nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null\n", + "```\n", "\n", - "There is a default [emission period](https://druid.apache.org/docs/latest/configuration/#enabling-metrics) of 1 minute. Add `druid.monitoring.emissionPeriod` to your configuration to have metrics emitted at a different rate.\n", + "Run this command to see the metrics as they are being emitted. For ease of reading, the `jq` portion of this command only selects the timestamp, metric name, and its value.\n", "\n", - "## Query the metric data" + "```bash\n", + "grep 'org.apache.druid.java.util.emitter.core.LoggingEmitter - \\[metrics\\]' ~/learn-druid-local/apache-druid/log/historical.log \\\n", + " | cut -d ' ' -f 7- \\\n", + " | jq '\"\\(.timestamp) \\(.metric) \\(.value)\"'\n", + "```" ] }, { @@ -248,12 +281,21 @@ "source": [ "## Learn more\n", "\n", - "You've seen how Log4J is used by Apache Druid, and about the default configurations.\n", - "\n", - "For more information:\n", + "You've seen how the two components of Druid's configuration for metrics are controlled - through monitors and through emitters - and that you can configure these either at the cluster level or for individual processes.\n", "\n", - "* Try out all the other monitors that are available, remembering that some monitors are only applicable to specific processes, requiring you to modify the `runtime.properties` for those processes only." + "* Read about [monitors](https://druid.apache.org/docs/latest/configuration/index.html#metrics-monitors) and [emitters](https://druid.apache.org/docs/latest/configuration/index.html#metrics-emitters) in the official documentation.\n", + "* Try out all the other monitors that are available, remembering that some monitors are only applicable to specific processes, requiring you to modify the `runtime.properties` for those processes only.\n", + "* Try out some of the [core emitters](https://druid.apache.org/docs/latest/configuration/#metrics-emitters) that are available as well as those available as [community extensions](https://druid.apache.org/docs/latest/configuration/extensions/#community-extensions), such as the [Apache Kafka](https://druid.apache.org/docs/latest/development/extensions-contrib/kafka-emitter), [statsd](https://druid.apache.org/docs/latest/development/extensions-contrib/statsd), and [Prometheus](https://druid.apache.org/docs/latest/development/extensions-contrib/prometheus) emitters.\n", + "* Experiment by using the Kafka emitter to push your instance's own metrics into a topic that you then consume back into the cluster with real-time ingestion." ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0f5f8da-a874-451f-9cf0-771202142fcf", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { From 99db02c93cc44ed5e353c4ea68c09d9581580005 Mon Sep 17 00:00:00 2001 From: Peter Marshall Date: Thu, 29 Feb 2024 15:20:24 +0000 Subject: [PATCH 4/4] Update 03-druid-metrics.ipynb Added suggestions from @ektravel , plus adding prereq to be using a Mac. --- notebooks/05-operations/03-druid-metrics.ipynb | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/notebooks/05-operations/03-druid-metrics.ipynb b/notebooks/05-operations/03-druid-metrics.ipynb index 37cde88c..33304da5 100644 --- a/notebooks/05-operations/03-druid-metrics.ipynb +++ b/notebooks/05-operations/03-druid-metrics.ipynb @@ -19,7 +19,7 @@ "source": [ "## Prerequisites\n", "\n", - "This tutorial works with Druid 29.0.0 or later.\n", + "This tutorial works with Druid 29.0.0 or later, and is designed to be run on a Mac.\n", "\n", "Launch this tutorial and all prerequisites using the `jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).\n", "\n", @@ -33,18 +33,24 @@ "source": [ "## Initialization\n", "\n", - "To use this notebook, you will need to run Druid locally.\n", + "To use this notebook, you must have Druid running locally.\n", "\n", "You will also make extensive use of the terminal, which you can place alongside this notebook or on another screen.\n", "\n", - "### Install tools\n", + "### Install required tools\n", "\n", "Open a local terminal window.\n", "\n", - "If you have not already install `wget`, run the following commands to install these tools using `brew`.\n", + "If you haven't installed `wget` or `multitail` yet, run the following commands to install them using `brew`.\n", "\n", "```bash\n", - "brew install wget\n", + "brew install multitail ; brew install wget\n", + "```\n", + "\n", + "To fetch the default configuration for `multitail` to your home folder, execute the following command. Skip this step if you are already running `multitail` as it will overwrite your own configuration.\n", + "\n", + "```bash\n", + "curl https://raw.githubusercontent.com/halturin/multitail/master/multitail.conf > ~/.multitailrc\n", "```\n", "\n", "### Install Apache Druid\n", @@ -56,7 +62,7 @@ "cd learn-druid-local\n", "```\n", "\n", - "Now pull and decompress a compatible version of Druid.\n", + "Pull and extract a compatible version of Druid.\n", "\n", "```bash\n", "wget https://dlcdn.apache.org/druid/29.0.0/apache-druid-29.0.0-bin.tar.gz\n",