diff --git a/.github/README.md b/.github/README.md index 00d03c0633..ea7c06fbc5 100644 --- a/.github/README.md +++ b/.github/README.md @@ -15,7 +15,7 @@ The following guide describes how to setup the OpenTelemetry demo with OpenSearc ```bash git clone https://github.com/opensearch/opentelemetry-demo.git cd opentelemetry-demo -docker-compose up -d +docker compose up -d ``` ### Services @@ -33,22 +33,25 @@ OpenSearch has [documented](https://opensearch.org/docs/latest/observing-your-da The next instructions are similar and use the same docker compose file. 1. Start the demo with the following command from the repository's root directory: ``` - docker-compose up -d + docker compose up -d ``` -**Note:** The docker-compose `--no-build` flag is used to fetch released docker images from [ghcr](http://ghcr.io/open-telemetry/demo) instead of building from source. +**Note:** The docker compose `--no-build` flag is used to fetch released docker images from [ghcr](http://ghcr.io/open-telemetry/demo) instead of building from source. Removing the `--no-build` command line option will rebuild all images from source. It may take more than 20 minutes to build if the flag is omitted. ### Explore and analyze the data With OpenSearch Observability Review revised OpenSearch [Observability Architecture](architecture.md) -### Service map +### Start learning OpenSearch Observability using our tutorial +[Getting started Tutorial](../tutorial/README.md) + +#### Service map ![Service map](https://docs.aws.amazon.com/images/opensearch-service/latest/developerguide/images/ta-dashboards-services.png) -### Traces +#### Traces ![Traces](https://opensearch.org/docs/2.6/images/ta-trace.png) -### Correlation +#### Correlation ![Correlation](https://opensearch.org/docs/latest/images/observability-trace.png) -### Logs +#### Logs ![Logs](https://opensearch.org/docs/latest/images/trace_log_correlation.gif) \ No newline at end of file diff --git a/.github/architecture.md b/.github/architecture.md index 78733a48bc..d6ad7669d9 100644 --- a/.github/architecture.md +++ b/.github/architecture.md @@ -46,6 +46,11 @@ Backend supportive services - See [description](../src/featureflagservice/README.md) - [Grafana](http://grafana:3000) - See [description](https://github.com/YANG-DB/opentelemetry-demo/blob/12d52cbb23bbf4226f6de2dfec840482a0a7d054/docker-compose.yml#L637) + +### Services Topology +The next diagram shows the docker compose services dependencies + +![](img/docker-services-topology.png) --- ## Purpose diff --git a/.github/img/DemoFlow.png b/.github/img/DemoFlow.png new file mode 100644 index 0000000000..ddd45b41b6 Binary files /dev/null and b/.github/img/DemoFlow.png differ diff --git a/.github/img/docker-services-topology.png b/.github/img/docker-services-topology.png new file mode 100644 index 0000000000..8e3db8fca9 Binary files /dev/null and b/.github/img/docker-services-topology.png differ diff --git a/docker-compose.yml b/docker-compose.yml index eb923fd429..8abcc62c87 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -297,6 +297,7 @@ services: - OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE - WEB_OTEL_SERVICE_NAME=frontend-web depends_on: + - accountingservice - adservice - cartservice - checkoutservice @@ -771,6 +772,8 @@ services: OPENSEARCH_HOSTS: '["https://opensearch-node1:9200","https://opensearch-node2:9200"]' # Define the OpenSearch nodes that OpenSearch Dashboards will query depends_on: - opensearch-node1 + - opensearch-node2 + - prometheus # Observability OSD Integrations integrations: diff --git a/src/currencyservice/README.md b/src/currencyservice/README.md index 27a23f397f..be2951d9ea 100644 --- a/src/currencyservice/README.md +++ b/src/currencyservice/README.md @@ -9,7 +9,7 @@ To build the currency service, run the following from root directory of opentelemetry-demo ```sh -docker-compose build currencyservice +docker compose build currencyservice ``` ## Run the service @@ -17,7 +17,7 @@ docker-compose build currencyservice Execute the below command to run the service. ```sh -docker-compose up currencyservice +docker compose up currencyservice ``` ## Run the client diff --git a/src/integrations/display/demo-landing-page.ndjson b/src/integrations/display/demo-landing-page.ndjson new file mode 100644 index 0000000000..6b751dac02 --- /dev/null +++ b/src/integrations/display/demo-landing-page.ndjson @@ -0,0 +1,3 @@ +{"attributes":{"description":"OTEL demo landing page","kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[]}"},"title":"OTEL demo landing page","uiStateJSON":"{}","version":1,"visState":"{\"title\":\"OTEL demo landing page\",\"type\":\"markdown\",\"aggs\":[],\"params\":{\"fontSize\":12,\"openLinksInNewTab\":false,\"markdown\":\"\\n![](https://raw.githubusercontent.com/opensearch-project/.github/main/profile/banner.jpg)\\n# OpenSearch Observability OTEL Demo\\n\\nWelcome to the [OpenSearch](https://opensearch.org/docs/latest) OpenTelemetry [Demo](https://opentelemetry.io/docs/demo/) documentation, which covers how to install and run the demo, and some scenarios you can use to view OpenTelemetry in action.\\n\\n## Purpose\\nThe purpose of this demo is to demonstrate the different capabilities of OpenSearch Observability to investigate and reflect your system.\\n\\n![](../../../.github/img/DemoFlow.png)\\n\\n### Services\\n[OTEL DEMO](https://opentelemetry.io/docs/demo/services/) Describes the list of services that are composing the Astronomy Shop.\\n\\nThe main services that are open to user interactions:\\n\\n- [Dashboards](https://observability.playground.opensearch.org/)\\n\\n- [Demo Proxy](https://observability.playground.demo-proxy.opensearch.org/)\\n\\n- [Demo loader](https://observability.playground.demo-loader.opensearch.org/)\\n\\n- [Demo feature-flag](https://observability.playground.demo-feature-flag.opensearch.org/)\\n\\n### Screenshots\\n![](https://opentelemetry.io/docs/demo/screenshots/frontend-1.png)\\n\\n_**The shopping App**_\\n![](https://opentelemetry.io/docs/demo/screenshots/frontend-2.png)\\n\\n_**The feature flag**_\\n![](https://opentelemetry.io/docs/demo/screenshots/feature-flag-ui.png)\\n\\n_**The load generator**_\\n![](https://opentelemetry.io/docs/demo/screenshots/load-generator-ui.png)\\n\\n---\\n### Ingestion\\nThe ingestion capabilities for OpenSearch is to be able to support multiple pipelines:\\n- [Data-Prepper](https://github.com/opensearch-project/data-prepper/) is an OpenSearch ingestion project that allows ingestion of OTEL standard signals using Otel-Collector\\n- [Jaeger](https://opensearch.org/docs/latest/observing-your-data/trace/trace-analytics-jaeger/) is an ingestion framework which has a build in capability for pushing OTEL signals into OpenSearch\\n- [Fluent-Bit](https://docs.fluentbit.io/manual/pipeline/outputs/opensearch) is an ingestion framework which has a build in capability for pushing OTEL signals into OpenSearch\\n\\n### Integrations -\\nThe integration service is a list of pre-canned assets that are loaded in a combined manner to allow users the ability for simple and automatic way to discover and review their services topology.\\n\\nThese (demo-sample) integrations contain the following assets:\\n- components & index template mapping\\n- datasources\\n- data-stream & indices\\n- queries\\n- dashboards\\n \\n\"}}"},"id":"dd4bebe0-f66a-11ed-9518-f5d5eb1d70bf","migrationVersion":{"visualization":"7.10.0"},"references":[],"type":"visualization","updated_at":"2023-05-19T17:30:35.804Z","version":"WzM0LDJd"} +{"attributes":{"description":"OTEL demo landing page","hits":0,"kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"language\":\"kuery\",\"query\":\"\"},\"filter\":[]}"},"optionsJSON":"{\"hidePanelTitles\":false,\"useMargins\":true}","panelsJSON":"[{\"version\":\"2.7.0\",\"gridData\":{\"x\":0,\"y\":0,\"w\":24,\"h\":15,\"i\":\"0e0c418a-81f8-4d85-8ba7-8d8ef6e2b1d7\"},\"panelIndex\":\"0e0c418a-81f8-4d85-8ba7-8d8ef6e2b1d7\",\"embeddableConfig\":{},\"panelRefName\":\"panel_0\"}]","timeRestore":false,"title":"OTEL demo landing page","version":1},"id":"e66e2da0-f66a-11ed-9518-f5d5eb1d70bf","migrationVersion":{"dashboard":"7.9.3"},"references":[{"id":"dd4bebe0-f66a-11ed-9518-f5d5eb1d70bf","name":"panel_0","type":"visualization"}],"type":"dashboard","updated_at":"2023-05-19T17:30:51.130Z","version":"WzM1LDJd"} +{"exportedCount":2,"missingRefCount":0,"missingReferences":[]} \ No newline at end of file diff --git a/src/integrations/display/memory-leak-tutorial.ndjson b/src/integrations/display/memory-leak-tutorial.ndjson new file mode 100644 index 0000000000..0820a2a3ca --- /dev/null +++ b/src/integrations/display/memory-leak-tutorial.ndjson @@ -0,0 +1,3 @@ +{"attributes":{"description":"this Pattern present a memory leak diagnostic procedure tutorial","kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[]}"},"title":"mem-leak-diagnostic","uiStateJSON":"{}","version":1,"visState":"{\"title\":\"mem-leak-diagnostic\",\"type\":\"markdown\",\"aggs\":[],\"params\":{\"fontSize\":12,\"openLinksInNewTab\":false,\"markdown\":\"# Memory Leak Investigation Tutorial\\n\\n## Tutorial Definition\\n\\nThe following tutorial describes Using Metrics and Traces to diagnose a memory leak\\nApplication telemetry, such as the kind that OpenTelemetry can provide, is very useful for diagnosing issues in a\\ndistributed system. In this scenario, we will walk through a scenario demonstrating how to move from high-level metrics\\nand traces to determine the cause of a memory leak.\\n\\n## Diagnosis\\n\\nThe first step in diagnosing a problem is to determine that a problem exists. Often the first stop will be a metrics\\ndashboard provided by a tool such as metrics analytics under open search observability.\\n\\n## Dashboards\\n\\nThis tutorial contains the OTEL demo dashboards with a number of charts:\\n\\n- Recommendation Service (CPU% and Memory)\\n- Service Latency (from SpanMetrics)\\n- Error Rate\\n\\nRecommendation Service charts are generated from OpenTelemetry Metrics exported to Prometheus, while the Service Latency\\nand Error Rate charts are generated through the OpenTelemetry Collector Span Metrics processor.\\n\\nFrom our dashboard, we can see that there seems to be anomalous behavior in the recommendation service – spiky CPU\\nutilization, as well as long tail latency in our p95, 99, and 99.9 histograms. We can also see that there are\\nintermittent spikes in the memory utilization of this service.\\nWe know that we’re emitting trace data from our application as well, so let’s think about another way that we’d be able\\nto determine that a problem exist.\\n\\n### Traces exploration\\n\\nOpenSearch Observability Trace analytics allows us to search for traces and display the end-to-end latency of an entire\\nrequest with visibility into each individual part of the overall request. Perhaps we noticed an increase in tail latency\\non our frontend requests. Traces dashboard allows us to then search and filter our traces to include only those that\\ninclude requests to recommendation service.\\n\\nBy sorting by latency, we’re able to quickly find specific traces that took a long time. Clicking on a trace in the\\nright panel, we’re able to view the waterfall view.\\nWe can see that the recommendation service is taking a long time to complete its work, and viewing the details allows us\\nto get a better idea of what’s going on.\\n\\n### Confirming the Diagnosis\\n\\nWe can see in our waterfall view that the app.cache_hit attribute is set to false, and that the `app.products.count` value\\nis extremely high.\\n\\nReturning to the search UI, filter to `recommendationservice` in the Service dropdown, and search for app.cache_hit=true\\nin the Tags box.\\n\\nNotice that requests tend to be faster when the cache is hit. Now search for `app.cache_hit=false` and compare the\\nlatency.\\n\\nYou should notice some changes in the visualization at the top of the trace list.\\n\\nNow, since this is a contrived scenario, we know where to find the underlying bug in our code. However, in a real-world\\nscenario, we may need to perform further searching to find out what’s going on in our code, or the interactions between\\nservices that cause it.\\n\\n### SOP flow context aware\\n\\nThe next diagram shows the context aware phases within this SOP.\\n\\nThe user can be shown the summary of the flow for solving his issue and in addition can focus on the actual step he is\\ncurrently performing.\\n\\nThe overall process is mapped into a **state machine** in-which each step has a state with a **transition**.\\n\\nWhen user goes into a different **scope** (`time based` ,`service based`, `log based`) this is defined as a indexed Context (`Ctx[1]`,`Ctx[2]`,...)\\n\\n---\\n\\nThis sequence outlines a process for investigating memory leaks that begins with gathering service data from both Prometheus and OpenSearch. Upon combining and reviewing latency of these services, an anomaly detection leads to a review of service traces, followed by log correlation, log fetching, and eventually an overlay of logs to highlight differences.\\n\\n```mermaid\\n Info[Memory Leak Investigation]\\n |\\n V\\nGet All Services --> Query?[Prometheus]\\n | |\\n | V\\n |--> Query?[OpenSearch]\\n | |\\n V V\\nCombine --> Review[Services Latency]\\n |\\n V\\nIdentify Anomaly --> Query?[Service@traces]\\n | |\\n | V\\n |--> Time Based --> Review[Services traces]\\n | |\\n V V\\nWhats Next? --> Suggest[Correlation with logs]\\n | |\\n | V\\n |--> Fetch Logs --> Review[logs]\\n | |\\n V V\\nWhats Next? --> Suggest[logs overlay]\\n | |\\n | V\\n |--> Fetch Logs --> Review[logs diff]\\n | |\\n V V\\nEnd <------------------ End\\n\\n```\\n\"}}"},"id":"92546710-f751-11ed-b6d0-850581e4a72d","migrationVersion":{"visualization":"7.10.0"},"references":[],"type":"visualization","updated_at":"2023-05-20T21:02:03.776Z","version":"WzUxLDVd"} +{"attributes":{"description":"this Pattern present a memory leak diagnostic procedure tutorial","hits":0,"kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"language\":\"kuery\",\"query\":\"\"},\"filter\":[]}"},"optionsJSON":"{\"hidePanelTitles\":false,\"useMargins\":true}","panelsJSON":"[{\"version\":\"2.7.0\",\"gridData\":{\"x\":0,\"y\":0,\"w\":24,\"h\":15,\"i\":\"a1954dc7-8655-4ea8-9a75-67cbe201b80c\"},\"panelIndex\":\"a1954dc7-8655-4ea8-9a75-67cbe201b80c\",\"embeddableConfig\":{},\"panelRefName\":\"panel_0\"}]","timeRestore":false,"title":"mem-leak-dignostic","version":1},"id":"9aa66080-f751-11ed-b6d0-850581e4a72d","migrationVersion":{"dashboard":"7.9.3"},"references":[{"id":"92546710-f751-11ed-b6d0-850581e4a72d","name":"panel_0","type":"visualization"}],"type":"dashboard","updated_at":"2023-05-20T21:02:17.736Z","version":"WzUyLDVd"} +{"exportedCount":2,"missingRefCount":0,"missingReferences":[]} \ No newline at end of file diff --git a/src/integrations/display/otel-architecture.ndjson b/src/integrations/display/otel-architecture.ndjson new file mode 100644 index 0000000000..1cfcd0b4ec --- /dev/null +++ b/src/integrations/display/otel-architecture.ndjson @@ -0,0 +1,3 @@ +{"attributes":{"description":"OTEL Astronomy Demo Application architecture","kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[]}"},"title":"otel-architecture","uiStateJSON":"{}","version":1,"visState":"{\"title\":\"otel-architecture\",\"type\":\"markdown\",\"aggs\":[],\"params\":{\"fontSize\":12,\"openLinksInNewTab\":false,\"markdown\":\"# OTEL Astronomy Demo Application\\n\\nThe following diagram presents the OTEL Astronomy shop services architecture:\\n\\n![](img/DemoServices.png)\\n\\n\\n### Trace Collectors\\nGaining a macro-level perspective on incoming data, such as sample counts and cardinality, is essential for comprehending the collector’s internal dynamics. However, when delving into the details, the interconnections can become complex. The Collector Data Flow Dashboard aims to demonstrate the capabilities of the OpenTelemetry demo application, offering a solid foundation for users to build upon.\\n\\nMonitoring data flow through the OpenTelemetry Collector is crucial for several reasons.\\n - All services are traces in all the development languages\\n - Auto instrumented\\n - Manual spans and attributes\\n - Span events\\n - Span links\\n\\nTrace Headers are propagated across all services (**Context propagation**)\\n\\n\\n### Metric Collectors\\nCollecting all the KPI information into Prometheus time series storage including:\\n - runtime metrics\\n - HTTP / gRPC latency distribution\\n\\n### Data Flow Overview\\nCollector Data Flow Dashboard provides valuable guidance on which metrics to monitor. Users can tailor their own dashboard variations by adding necessary metrics specific to their use cases, such as memory_delimiter processor or other data flow indicators. This demo dashboard serves as a starting point, enabling users to explore diverse usage scenarios and adapt the tool to their unique monitoring needs.\\n\\nThe diagram below provides an overview of the system components, showcasing the configuration derived from the OpenTelemetry Collector (otelcol) configuration file utilized by the OpenTelemetry demo application. Additionally, it highlights the observability data (traces and metrics) flow within the system.\\n\\n#### Simple purchase use case\\nThis flow diagram shows the trace evolution from the user selecting a purchased item going through different backend services until reaching the storage database.\\n![](img/DemoFlow.png)\\n\\n## Reference\\n**_OTEL Demo info_**\\n- [architecture](https://opentelemetry.io/docs/demo/architecture/)\\n- [collector-data-flow-dashboard](https://opentelemetry.io/docs/demo/collector-data-flow-dashboard/)\\n- [services](https://opentelemetry.io/docs/demo/services/)\\n\\n**_OTEL Demo youtubes_**\\n - [Cloud Native Live: OpenTelemetry community demo](https://www.youtube.com/watch?v=kD0EAjly9jc)\\n\"}}"},"id":"5c297aa0-f750-11ed-b6d0-850581e4a72d","migrationVersion":{"visualization":"7.10.0"},"references":[],"type":"visualization","updated_at":"2023-05-20T20:53:23.402Z","version":"WzQ3LDVd"} +{"attributes":{"description":"OTEL Astronomy Demo Application architecture","hits":0,"kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"language\":\"kuery\",\"query\":\"\"},\"filter\":[]}"},"optionsJSON":"{\"hidePanelTitles\":false,\"useMargins\":true}","panelsJSON":"[{\"version\":\"2.7.0\",\"gridData\":{\"x\":0,\"y\":0,\"w\":24,\"h\":15,\"i\":\"cc8d389f-6ab0-4590-bd7d-140ed04a28b1\"},\"panelIndex\":\"cc8d389f-6ab0-4590-bd7d-140ed04a28b1\",\"embeddableConfig\":{},\"panelRefName\":\"panel_0\"}]","timeRestore":false,"title":"otel-demo-architecture","version":1},"id":"67e37e40-f750-11ed-b6d0-850581e4a72d","migrationVersion":{"dashboard":"7.9.3"},"references":[{"id":"5c297aa0-f750-11ed-b6d0-850581e4a72d","name":"panel_0","type":"visualization"}],"type":"dashboard","updated_at":"2023-05-20T20:53:43.076Z","version":"WzQ4LDVd"} +{"exportedCount":2,"missingRefCount":0,"missingReferences":[]} \ No newline at end of file diff --git a/src/integrations/display/tutorial-main-page.ndjson b/src/integrations/display/tutorial-main-page.ndjson new file mode 100644 index 0000000000..456a6f3f2e --- /dev/null +++ b/src/integrations/display/tutorial-main-page.ndjson @@ -0,0 +1,3 @@ +{"attributes":{"description":"Observability OTEL demo tutorial main page","kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[]}"},"title":"tutorial-main","uiStateJSON":"{}","version":1,"visState":"{\"title\":\"tutorial-main\",\"type\":\"markdown\",\"aggs\":[],\"params\":{\"fontSize\":12,\"openLinksInNewTab\":false,\"markdown\":\"# OpenSearch Observability Tutorial\\n\\nWelcome to the OpenSearch Observability tutorials! \\n\\nThis tutorial is designed to guide users in the Observability domain through the process of using the OpenSearch Observability plugin. By the end of this tutorial, you will be familiar with building dashboards, creating Pipe Processing Language (PPL) queries, federating metrics from Prometheus data sources, and conducting root cause analysis investigations on your data.\\n\\n## Overview\\n\\nThis tutorial uses the OpenTelemetry demo application, an e-commerce application for an astronomy shop. The application includes multiple microservices, each providing different functionalities. These services are monitored and traced using the OpenTelemetry trace collector and additional agents.\\n\\nThe resulting traces and logs are stored in structured indices in OpenSearch indices, following the OpenTelemetry format. \\n\\nThis provides a realistic environment for learning and applying Observability concepts, investigation and diagnostic patterns.\\n\\n## Content\\n\\nThis tutorial is structured as follows:\\n\\n1. **Introduction to the OTEL demo infrastructure & Architecture**: An introduction to OTEL demo architecture and services, how they are monitored, traces and collected.\\n\\n2. **Introduction to OpenSearch Observability**: A brief introduction to the plugin, its features, and its advantages.\\n\\n3. **Building Dashboards**: Step-by-step guide on how to create effective and informative dashboards in OpenSearch Observability.\\n\\n4. **Creating PPL Queries**: Learn how to create PPL queries to extract valuable insights from your data.\\n\\n5. **Federating Metrics from Prometheus**: Detailed guide on how to federate metrics from a Prometheus data source into OpenSearch Observability.\\n\\n6. **Conducting Root Cause Analysis**: Learn how to use the built-in features of OpenSearch Observability to conduct a root cause analysis investigation on your data.\\n\\n7. **OpenTelemetry Integration**: Learn how the OpenTelemetry demo application sends data to OpenSearch and how to navigate and understand this data in OpenSearch Observability.\\n\\nThis tutorial would enhance your understanding of Observability and your ability to use OpenSearch Observability to its fullest.\\n\\n**_Enjoy the learning journey!_**\\n\\n## Prerequisites\\n\\nTo get the most out of this tutorial, you should have a basic understanding of Observability, microservice architectures, and the OpenTelemetry ecosystem.\\n\\n## Getting Started\\n\\nTo start the tutorial, navigate to the `Introduction to OpenSearch Observability` section.\\n\\nHappy Learning!\\n\\n---\\n\\n#### 1. [OTEL Demo Architecture](OTEL Demo Architecture.md) \\n\\n#### 2. [Observability Introduction](Observability Introduction.md) \\n\\n#### 3. [Memory Leak Investigation Tutorial](Memory Leak Tutorial.md) \\n\\n\\n---\\n## References\\n\\n[Cloud Native OpenTelemetry community you-tube lecture](https://www.youtube.com/watch?v=kD0EAjly9jc)\"}}"},"id":"e8b42d80-f750-11ed-b6d0-850581e4a72d","migrationVersion":{"visualization":"7.10.0"},"references":[],"type":"visualization","updated_at":"2023-05-20T20:57:19.192Z","version":"WzQ5LDVd"} +{"attributes":{"description":"Observability OTEL demo tutorial main page","hits":0,"kibanaSavedObjectMeta":{"searchSourceJSON":"{\"query\":{\"language\":\"kuery\",\"query\":\"\"},\"filter\":[]}"},"optionsJSON":"{\"hidePanelTitles\":false,\"useMargins\":true}","panelsJSON":"[{\"version\":\"2.7.0\",\"gridData\":{\"x\":0,\"y\":0,\"w\":24,\"h\":15,\"i\":\"0477cef5-8f99-4c83-afae-8e5bbf4dc89e\"},\"panelIndex\":\"0477cef5-8f99-4c83-afae-8e5bbf4dc89e\",\"embeddableConfig\":{},\"panelRefName\":\"panel_0\"}]","timeRestore":false,"title":"tutorial-main","version":1},"id":"f0805520-f750-11ed-b6d0-850581e4a72d","migrationVersion":{"dashboard":"7.9.3"},"references":[{"id":"e8b42d80-f750-11ed-b6d0-850581e4a72d","name":"panel_0","type":"visualization"}],"type":"dashboard","updated_at":"2023-05-20T20:57:32.274Z","version":"WzUwLDVd"} +{"exportedCount":2,"missingRefCount":0,"missingReferences":[]} \ No newline at end of file diff --git a/tutorial/DemoLandingPage.md b/tutorial/DemoLandingPage.md new file mode 100644 index 0000000000..1669e56236 --- /dev/null +++ b/tutorial/DemoLandingPage.md @@ -0,0 +1,64 @@ + +_![](https://raw.githubusercontent.com/opensearch-project/.github/main/profile/banner.jpg) +# OpenSearch Observability OTEL Demo + +Welcome to the [OpenSearch](https://opensearch.org/docs/latest) OpenTelemetry [Demo](https://opentelemetry.io/docs/demo/) documentation, which covers how to install and run the demo, and some scenarios you can use to view OpenTelemetry in action. + +## Purpose +The purpose of this demo is to demonstrate the different capabilities of OpenSearch Observability to investigate and reflect your system. + +![](img/DemoFlow.png) + +### Services +[OTEL DEMO](https://opentelemetry.io/docs/demo/services/) Describes the list of services that are composing the Astronomy Shop. + +The main services that are open to user interactions: + +- [Dashboards](https://observability.playground.opensearch.org/) + +- [Demo Proxy](https://observability.playground.demo-proxy.opensearch.org/) + +- [Demo loader](https://observability.playground.demo-loader.opensearch.org/) + +- [Demo feature-flag](https://observability.playground.demo-feature-flag.opensearch.org/) + +### Screenshots +_**The shopping App**_ +![](https://opentelemetry.io/docs/demo/screenshots/frontend-1.png) + +_**The feature flag**_ +![](https://opentelemetry.io/docs/demo/screenshots/feature-flag-ui.png) + +_**The load generator**_ +![](https://opentelemetry.io/docs/demo/screenshots/load-generator-ui.png) + +--- +### Ingestion +The ingestion capabilities for OpenSearch is to be able to support multiple pipelines: +- [Data-Prepper](https://github.com/opensearch-project/data-prepper/) is an OpenSearch ingestion project that allows ingestion of OTEL standard signals using Otel-Collector +- [Jaeger](https://opensearch.org/docs/latest/observing-your-data/trace/trace-analytics-jaeger/) is an ingestion framework which has a build in capability for pushing OTEL signals into OpenSearch +- [Fluent-Bit](https://docs.fluentbit.io/manual/pipeline/outputs/opensearch) is an ingestion framework which has a build in capability for pushing OTEL signals into OpenSearch + +### Integrations +The integration service is a list of pre-canned assets that are loaded in a combined manner to allow users the ability for simple and automatic way to discover and review their services topology. + +These (demo-sample) integrations contain the following assets: +- components & index template mapping +- datasources +- data-stream & indices +- queries +- dashboards_ + +### Tutorials + +Welcome to the OpenSearch Observability tutorials! + +This tutorial is designed to guide users in the Observability domain through the process of using the OpenSearch Observability plugin. By the end of this tutorial, you will be familiar with building dashboards, creating Pipe Processing Language (PPL) queries, federating metrics from Prometheus data sources, and conducting root cause analysis investigations on your data. + +### Overview + +This tutorial uses the OpenTelemetry demo application, an e-commerce application for an astronomy shop. The application includes multiple microservices, each providing different functionalities. These services are monitored and traced using the OpenTelemetry trace collector and additional agents. +The resulting traces and logs are stored in structured indices in OpenSearch indices, following the OpenTelemetry format. +This provides a realistic environment for learning and applying Observability concepts, investigation and diagnostic patterns. + +[Happy Learning](README.md) diff --git a/tutorial/Memory Leak Tutorial.md b/tutorial/Memory Leak Tutorial.md new file mode 100644 index 0000000000..19e8f6ce93 --- /dev/null +++ b/tutorial/Memory Leak Tutorial.md @@ -0,0 +1,113 @@ +# Memory Leak Investigation Tutorial + +## Tutorial Definition + +The following tutorial describes Using Metrics and Traces to diagnose a memory leak +Application telemetry, such as the kind that OpenTelemetry can provide, is very useful for diagnosing issues in a +distributed system. In this scenario, we will walk through a scenario demonstrating how to move from high-level metrics +and traces to determine the cause of a memory leak. + +## Diagnosis + +The first step in diagnosing a problem is to determine that a problem exists. Often the first stop will be a metrics +dashboard provided by a tool such as metrics analytics under open search observability. + +## Dashboards + +This tutorial contains the OTEL demo dashboards with a number of charts: + +- Recommendation Service (CPU% and Memory) +- Service Latency (from SpanMetrics) +- Error Rate + +Recommendation Service charts are generated from OpenTelemetry Metrics exported to Prometheus, while the Service Latency +and Error Rate charts are generated through the OpenTelemetry Collector Span Metrics processor. + +From our dashboard, we can see that there seems to be anomalous behavior in the recommendation service – spiky CPU +utilization, as well as long tail latency in our p95, 99, and 99.9 histograms. We can also see that there are +intermittent spikes in the memory utilization of this service. +We know that we’re emitting trace data from our application as well, so let’s think about another way that we’d be able +to determine that a problem exist. + +### Traces exploration + +OpenSearch Observability Trace analytics allows us to search for traces and display the end-to-end latency of an entire +request with visibility into each individual part of the overall request. Perhaps we noticed an increase in tail latency +on our frontend requests. Traces dashboard allows us to then search and filter our traces to include only those that +include requests to recommendation service. + +By sorting by latency, we’re able to quickly find specific traces that took a long time. Clicking on a trace in the +right panel, we’re able to view the waterfall view. +We can see that the recommendation service is taking a long time to complete its work, and viewing the details allows us +to get a better idea of what’s going on. + +### Confirming the Diagnosis + +We can see in our waterfall view that the app.cache_hit attribute is set to false, and that the `app.products.count` value +is extremely high. + +Returning to the search UI, filter to `recommendationservice` in the Service dropdown, and search for app.cache_hit=true +in the Tags box. + +Notice that requests tend to be faster when the cache is hit. Now search for `app.cache_hit=false` and compare the +latency. + +You should notice some changes in the visualization at the top of the trace list. + +Now, since this is a contrived scenario, we know where to find the underlying bug in our code. However, in a real-world +scenario, we may need to perform further searching to find out what’s going on in our code, or the interactions between +services that cause it. + +### SOP flow context aware + +The next diagram shows the context aware phases within this SOP. + +The user can be shown the summary of the flow for solving his issue and in addition can focus on the actual step he is +currently performing. + +The overall process is mapped into a **state machine** in-which each step has a state with a **transition**. + +When user goes into a different **scope** (`time based` ,`service based`, `log based`) this is defined as a indexed Context (`Ctx[1]`,`Ctx[2]`,...) + + +```mermaid + stateDiagram-v2 + + State0: Info[Memory Leak Investigation] + [*] --> State0: zoom in time frame + state Ctx[0] { + state fork_state <> + State0 --> fork_state :Get All Services + fork_state --> State2 + State2: Query?[Promethues] + + fork_state --> State3 + State3: Query?[OpenSearch] + + state join_state <> + State2 --> join_state + State3 --> join_state + join_state --> State4:combine + State4: Review[Services Latency] + State4 --> State5 : identify Anomaly + state Ctx[1] { + State5: Query?[Service@traces] + State5 --> State6 : time based + State6: Review[Services traces] + State6 --> [*] + State6 --> State7 : whats next ? + State7: Suggest[Correlation with logs] + State7 --> State8 :fetch Logs + state Ctx[2] { + State8: Review[logs] + State8 --> [*] + State8 --> State9: whats next ? + State9: Suggest[logs overlay] + State9 --> State10 :fetch Logs + State10: Review[logs diff] + State10 --> [*] + } + } + } + +``` diff --git a/tutorial/OTEL Demo Architecture.md b/tutorial/OTEL Demo Architecture.md new file mode 100644 index 0000000000..c23283aa7d --- /dev/null +++ b/tutorial/OTEL Demo Architecture.md @@ -0,0 +1,42 @@ +# OTEL Astronomy Demo Application + +The following diagram presents the OTEL Astronomy shop services architecture: + +![](img/DemoServices.png) + + +### Trace Collectors +Gaining a macro-level perspective on incoming data, such as sample counts and cardinality, is essential for comprehending the collector’s internal dynamics. However, when delving into the details, the interconnections can become complex. The Collector Data Flow Dashboard aims to demonstrate the capabilities of the OpenTelemetry demo application, offering a solid foundation for users to build upon. + +Monitoring data flow through the OpenTelemetry Collector is crucial for several reasons. + - All services are traces in all the development languages + - Auto instrumented + - Manual spans and attributes + - Span events + - Span links + +Trace Headers are propagated across all services (**Context propagation**) + + +### Metric Collectors +Collecting all the KPI information into Prometheus time series storage including: + - runtime metrics + - HTTP / gRPC latency distribution + +### Data Flow Overview +Collector Data Flow Dashboard provides valuable guidance on which metrics to monitor. Users can tailor their own dashboard variations by adding necessary metrics specific to their use cases, such as memory_delimiter processor or other data flow indicators. This demo dashboard serves as a starting point, enabling users to explore diverse usage scenarios and adapt the tool to their unique monitoring needs. + +The diagram below provides an overview of the system components, showcasing the configuration derived from the OpenTelemetry Collector (otelcol) configuration file utilized by the OpenTelemetry demo application. Additionally, it highlights the observability data (traces and metrics) flow within the system. + +#### Simple purchase use case +This flow diagram shows the trace evolution from the user selecting a purchased item going through different backend services until reaching the storage database. +![](img/DemoFlow.png) + +## Reference +**_OTEL Demo info_** +- [architecture](https://opentelemetry.io/docs/demo/architecture/) +- [collector-data-flow-dashboard](https://opentelemetry.io/docs/demo/collector-data-flow-dashboard/) +- [services](https://opentelemetry.io/docs/demo/services/) + +**_OTEL Demo youtubes_** + - [Cloud Native Live: OpenTelemetry community demo](https://www.youtube.com/watch?v=kD0EAjly9jc) diff --git a/tutorial/Observability Introduction.md b/tutorial/Observability Introduction.md new file mode 100644 index 0000000000..206ead55d9 --- /dev/null +++ b/tutorial/Observability Introduction.md @@ -0,0 +1,160 @@ +# Observability Introduction Tutorial +The purpose of this tutorial is to provide the skill set for the user to start building his system's observability representation +using the tools supplied by the Observability plugin. + +--- + +## The Observability acronyms: + +The next section describes the main Observability acronyms that are used in the daily work of the Observability domain experts and reliability engineers. +Understanding their concepts and how to use them play a key factor in this tutorial. + +### The SAAFE model +The SAAFE model is a comprehensive approach to Observability that stands for Secure, Adaptable, Automated, Forensic, and Explainable. + +Each element of this model plays a vital role in enhancing the visibility and understanding of systems. + +- **"Secure"** ensures that all data in the system is protected and interactions are guarded against security threats. + +- **"Adaptable"** allows systems to adjust to changing conditions and requirements, making them robust and resilient to evolving business needs and technological advancements. + +- **"Automated"** involves the use of automation to reduce manual tasks, improve accuracy, and enhance system efficiency. This includes automated alerting, remediation, and anomaly detection, among other tasks. + +- **"Forensic"** refers to the ability to retrospectively analyze system states and behaviors, which is crucial for debugging, identifying root causes of issues, and learning from past incidents. + +- **"Explainable"** stresses the importance of clear, understandable insights. It's not just about having data; it's about making that data meaningful, comprehensible, and actionable for engineers and stakeholders. + +The SAAFE model provides a holistic approach to Observability, ensuring systems are reliable, efficient, secure, and user-friendly. + + +### The Insight Strategy + +To be able to correctly quantify the health of the system, an observability domain expert can create a set of metrics that represents the overall KPI health hot-spots +of the system. + +Once any of these KPIs are exceeded - an Insight is generated with the appropriate context in-which the user can investigate the cause of this behavior. + +The most likely metrics to take part of these collections are the most "central" services that are part of the system which have the highest potential to influence and impact +the user's satisfaction of the system. + +In this context, "centrality" refers to the importance or influence of certain components or services within the overall system. + +Central services are the ones that play a crucial role in system operations, acting as a hub or a nexus for other services. + +They often process a large volume of requests, interact with many other services, or handle critical tasks that directly impact the user experience. + +An issue in a central service can have cascading effects throughout the system, affecting many other components and potentially degrading the user experience significantly. + +That's why monitoring the key performance indicators (KPIs) of these central services can be particularly informative about the overall system's health. + +By focusing on these central services, an observability domain expert can quickly identify and address issues that are likely to have a significant impact on the system. + +When any of the KPIs for these services exceed their thresholds, an Insight is generated, providing valuable context for investigating and resolving the issue. + +This approach enhances system reliability and user satisfaction by ensuring that potential problems are identified and addressed proactively. + +### The RED monitoring Strategy + +The RED method in Observability is a key monitoring strategy adopted by organizations to understand the performance of their systems. +The RED acronym stands for **Rate**, **Errors**, and **Duration**. + +- **"Rate"** + +indicates the number of requests per second that your system is serving, helping you measure the load and traffic on your system. + +- **"Errors"** + +tracks the number of failed requests over a period of time, which could be due to various reasons such as server issues, bugs in the code, or problems with infrastructure. + +- **"Duration"** + +measures the amount of time it takes to process a request, which is crucial to understand the system latency and responsiveness. + +By monitoring these three aspects, organizations can gain valuable insights into their systems' health and performance, allowing them to make data-driven decisions, optimize processes, and maintain a high level of service quality. + +### Service Level Objectives (SLOs) and Service Level Agreements (SLAs) + +Service Level Objectives (SLOs) and Service Level Agreements (SLAs) are essential aspects of Observability, playing crucial roles in maintaining and improving the quality of services in any system. + +An SLO represents the target reliability of a particular service over a period, often defined in terms of specific metrics like error rate, latency, or uptime. + +They form the basis for informed decision making, providing a clear understanding of the expected system behavior and guiding the engineering teams in their operations and development efforts. + +On the other hand, an SLA is a formal agreement between a service provider and its users that defines the expected level of service. + +It usually includes the SLOs, as well as the repercussions for not meeting them, such as penalties or compensations. + +This ensures accountability, aids in setting realistic expectations, and allows the service provider to manage and mitigate potential issues proactively. Therefore, both SLOs and SLAs are indispensable tools for maintaining service quality, enhancing user satisfaction, and driving continuous improvement. + +### The burn-rate Strategy + +SLO burn rate is a concept in site reliability engineering that refers to the rate at which a service is consuming or "burning" through its error budget. + +The error budget is essentially the allowable threshold of unreliability, which is derived from the service's Service Level Objective (SLO). If a service is fully reliable and experiencing no issues, it won't be burning its error budget at all, meaning the burn rate is zero. + +On the other hand, if a service is experiencing issues and failures, it will be burning through its error budget at a certain rate. + +The burn rate is an important metric because it can provide an early warning sign of trouble. If the burn rate is too high, it means the service is using up its error budget too quickly, and if left unchecked, it could exceed its SLO before the end of the measurement period. + +By monitoring the burn rate, teams can proactively address issues, potentially before they escalate and impact users significantly. + +### Baseline and Anomaly Detection + +Anomaly detection in Observability is a powerful technique used to identify unusual behavior or outliers within system metrics that deviate from normal operation. + +Anomalies could be indicative of potential issues such as bugs in code, infrastructure problems, security breaches, or even performance degradation. + +For instance, an unexpected surge in error rates or latency might signify a system failure or a sudden drop in traffic could imply an issue with the user interface. + +Anomaly detection algorithms, often incorporating machine learning techniques which are based on taking the base-line from a functioning system. + +Once we have a baseline sampling mechanism it is used to analyze the system data over time to learn what constitutes "normal" behavior . + +Then, they continuously monitor the system's state and alert engineers when they detect patterns that diverge from this established norm. + +These alerts enable teams to proactively address issues, often before they affect end-users, thereby enhancing the system's reliability and performance. Anomaly detection plays an indispensable role in maintaining system health, reducing downtime, and ensuring an optimal user experience. + +### Alerts fatigue +Alert fatigue is the exhaustion and desensitization that can occur when system administrators, engineers, or operations teams are overwhelmed by a high volume of alerts, many of which may be unimportant, false, or redundant. + +This constant stream of information can result in critical alerts being overlooked or disregarded, leading to delayed response times and potentially serious system issues going unnoticed. + +Alert fatigue is not just a productivity issue—it can also have significant implications for system reliability and performance. + +To mitigate alert fatigue, Observability Insights must implement intelligent alerting systems that prioritize alerts based on their severity, relevance, and potential impact on the system. + +This includes tuning alert thresholds, grouping related alerts, and incorporating anomaly detection and machine learning to improve the accuracy and relevance of alerts. + +--- + +## Observability Workflow +This part will show how users can use OpenSearch Observability plugin to build an Observability monitoring solution and use it to further investigate and diagnose +Alerts and incidents in the system. + +### Introduction of the Observability tools +This section will give an overview description with short sample on how to use the Observability tools and API + 1) PPL Queries + 2) Saved search templates + 3) Correlation build-in queries + 4) Alerts and monitor KPI + 5) Logs analytics + 6) Trace analytics + 7) Metrics analytics + 8) Service maps / graph + +### Collecting telemetry signals using different providers +This section will show how to setup and configure the different ingestion capabilities users have to submit Observability signals into OpenSearch. + 1) Data-prepper - Traces / Metrics + 2) Jaeger - Traces + 3) Fluent-bit - Logs + +### How do we map OTEL Demo application topology + +This section will define the application services & infrastructure KPI / SLA to monitor + 1) Services break-down and prioritization according to impact analysis - **using graph centrality calculation** + 2) Defining the monitoring channels and alerts + 3) SLO / SLA definitions including burn-rate. + 4) Dashboards Creation for selected services (RED strategy) + 5) Main health dashboard definition + 6) Sampling data for 'health' baseline creation + diff --git a/tutorial/README.md b/tutorial/README.md new file mode 100644 index 0000000000..75b2248e6d --- /dev/null +++ b/tutorial/README.md @@ -0,0 +1,59 @@ +# OpenSearch Observability Tutorial + +Welcome to the OpenSearch Observability tutorials! + +This tutorial is designed to guide users in the Observability domain through the process of using the OpenSearch Observability plugin. By the end of this tutorial, you will be familiar with building dashboards, creating Pipe Processing Language (PPL) queries, federating metrics from Prometheus data sources, and conducting root cause analysis investigations on your data. + +## Overview + +This tutorial uses the OpenTelemetry demo application, an e-commerce application for an astronomy shop. The application includes multiple microservices, each providing different functionalities. These services are monitored and traced using the OpenTelemetry trace collector and additional agents. + +The resulting traces and logs are stored in structured indices in OpenSearch indices, following the OpenTelemetry format. + +This provides a realistic environment for learning and applying Observability concepts, investigation and diagnostic patterns. + +## Content + +This tutorial is structured as follows: + +1. **Introduction to the OTEL demo infrastructure & Architecture**: An introduction to OTEL demo architecture and services, how they are monitored, traces and collected. + +2. **Introduction to OpenSearch Observability**: A brief introduction to the plugin, its features, and its advantages. + +3. **Building Dashboards**: Step-by-step guide on how to create effective and informative dashboards in OpenSearch Observability. + +4. **Creating PPL Queries**: Learn how to create PPL queries to extract valuable insights from your data. + +5. **Federating Metrics from Prometheus**: Detailed guide on how to federate metrics from a Prometheus data source into OpenSearch Observability. + +6. **Conducting Root Cause Analysis**: Learn how to use the built-in features of OpenSearch Observability to conduct a root cause analysis investigation on your data. + +7. **OpenTelemetry Integration**: Learn how the OpenTelemetry demo application sends data to OpenSearch and how to navigate and understand this data in OpenSearch Observability. + +This tutorial would enhance your understanding of Observability and your ability to use OpenSearch Observability to its fullest. + +**_Enjoy the learning journey!_** + +## Prerequisites + +To get the most out of this tutorial, you should have a basic understanding of Observability, microservice architectures, and the OpenTelemetry ecosystem. + +## Getting Started + +To start the tutorial, navigate to the `Introduction to OpenSearch Observability` section. + +Happy Learning! + +--- + +#### 1. [OTEL Demo Architecture](OTEL Demo Architecture.md) + +#### 2. [Observability Introduction](Observability Introduction.md) + +#### 3. [Memory Leak Investigation Tutorial](Memory Leak Tutorial.md) + + +--- +## References + +[Cloud Native OpenTelemetry community you-tube lecture](https://www.youtube.com/watch?v=kD0EAjly9jc) \ No newline at end of file diff --git a/tutorial/img/DemoFlow.png b/tutorial/img/DemoFlow.png new file mode 100644 index 0000000000..ddd45b41b6 Binary files /dev/null and b/tutorial/img/DemoFlow.png differ diff --git a/tutorial/img/DemoServices.png b/tutorial/img/DemoServices.png new file mode 100644 index 0000000000..1d0e4cf13a Binary files /dev/null and b/tutorial/img/DemoServices.png differ