Observe the FastAPI application with three pillars of observability on Grafana:
- Traces with Tempo and OpenTelemetry Python SDK
- Metrics with Prometheus and Prometheus Python Client
- Logs with Loki
-
Install Loki Docker Driver
docker plugin install grafana/loki-docker-driver:2.9.2 --alias loki --grant-all-permissions
-
Start all services with docker-compose
docker-compose up -d
If got the error message
Error response from daemon: error looking up logging plugin loki: plugin loki found but disabled
, please run the following command to enable the plugin:docker plugin enable loki
-
Send requests with siege and curl to the FastAPI app
bash request-script.sh bash trace.sh
Or you can use Locust to send requests:
# install locust first with `pip install locust` if you don't have it locust -f locustfile.py --headless --users 10 --spawn-rate 1 -H http://localhost:8000
Or you can send requests with k6:
k6 run --vus 1 --duration 300s k6-script.js
-
Check predefined dashboard
FastAPI Observability
on Grafana http://localhost:3000/ login withadmin:admin
Dashboard screenshot:
The dashboard is also available on Grafana Dashboards.
Grafana provides a great solution, which could observe specific actions in service between traces, metrics, and logs through trace ID and exemplar.
Image Source: Grafana
Get Trace ID from an exemplar in metrics, then query in Tempo.
Query: histogram_quantile(.99,sum(rate(fastapi_requests_duration_seconds_bucket{app_name="app-a", path!="/metrics"}[1m])) by(path, le))
Get Trace ID and tags (here is compose.service
) defined in Tempo data source from span, then query with Loki.
Get Trace ID from log (regex defined in Loki data source), then query in Tempo.
For a more complex scenario, we use three FastAPI applications with the same code in this demo. There is a cross-service action in /chain
endpoint, which provides a good example of how to use OpenTelemetry SDK and how Grafana presents trace information.
We use OpenTelemetry Python SDK to send trace info with gRCP to Tempo. Each request span contains other child spans when using OpenTelemetry instrumentation. The reason is that instrumentation will catch each internal asgi interaction (opentelemetry-python-contrib issue #831). If you want to get rid of the internal spans, there is a workaround in the same issue #831 by using a new OpenTelemetry middleware with two overridden methods for span processing.
We use OpenTelemetry Logging Instrumentation to override the logger format with another format with trace id and span id.
# fastapi_app/utils.py
def setting_otlp(app: ASGIApp, app_name: str, endpoint: str, log_correlation: bool = True) -> None:
# Setting OpenTelemetry
# set the service name to show in traces
resource = Resource.create(attributes={
"service.name": app_name, # for Tempo to distinguish source
"compose_service": app_name # as a query criteria for Trace to logs
})
# set the tracer provider
tracer = TracerProvider(resource=resource)
trace.set_tracer_provider(tracer)
tracer.add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint=endpoint)))
if log_correlation:
LoggingInstrumentor().instrument(set_logging_format=True)
FastAPIInstrumentor.instrument_app(app, tracer_provider=tracer)
The following image shows the span info sent to Tempo and queried on Grafana. Trace span info provided by FastAPIInstrumentor
with trace ID (17785b4c3d530b832fb28ede767c672c), span id(d410eb45cc61f442), service name(app-a), custom attributes(service.name=app-a, compose_service=app-a) and so on.
Log format with trace id and span id, which is overridden by `LoggingInstrumentor``
%(asctime)s %(levelname)s [%(name)s] [%(filename)s:%(lineno)d] [trace_id=%(otelTraceID)s span_id=%(otelSpanID)s resource.service.name=%(otelServiceName)s] - %(message)s
The following image is what the logs look like.
If you want other services to use the same Trace ID, you have to use inject
function to add current span information to the header. Because OpenTelemetry FastAPI instrumentation only takes care of the asgi app's request and response, it does not affect any other modules or actions like sending HTTP requests to other servers or function calls.
# fastapi_app/main.py
from opentelemetry.propagate import inject
@app.get("/chain")
async def chain(response: Response):
headers = {}
inject(headers) # inject trace info to header
async with httpx.AsyncClient() as client:
await client.get(f"http://localhost:8000/", headers=headers,)
async with httpx.AsyncClient() as client:
await client.get(f"http://{TARGET_ONE_HOST}:8000/io_task", headers=headers,)
async with httpx.AsyncClient() as client:
await client.get(f"http://{TARGET_TWO_HOST}:8000/cpu_task", headers=headers,)
return {"path": "/chain"}
Alternatively, we can use the instrumentation library for HTTPX to instrument HTTPX. Following is the example of using OpenTelemetry HTTPX Instrumentation which will automatically inject trace info to the header.
import httpx
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
HTTPXClientInstrumentor().instrument()
@app.get("/chain")
async def chain(response: Response):
async with httpx.AsyncClient() as client:
await client.get(f"http://localhost:8000/")
async with httpx.AsyncClient() as client:
await client.get(f"http://{TARGET_ONE_HOST}:8000/io_task")
async with httpx.AsyncClient() as client:
await client.get(f"http://{TARGET_TWO_HOST}:8000/cpu_task")
return {"path": "/chain"}
Use Prometheus Python Client to generate OpenTelemetry format metric with exemplars and expose on /metrics
for Prometheus.
In order to add an exemplar to metrics, we retrieve the trace id from the current span for the exemplar and add the trace id dict to the Histogram or Counter metrics.
# fastapi_app/utils.py
from opentelemetry import trace
from prometheus_client import Histogram
REQUESTS_PROCESSING_TIME = Histogram(
"fastapi_requests_duration_seconds",
"Histogram of requests processing time by path (in seconds)",
["method", "path", "app_name"],
)
# retrieve trace id for exemplar
span = trace.get_current_span()
trace_id = trace.format_trace_id(
span.get_span_context().trace_id)
REQUESTS_PROCESSING_TIME.labels(method=method, path=path, app_name=self.app_name).observe(
after_time - before_time, exemplar={'TraceID': trace_id}
)
Because exemplars is a new datatype proposed in OpenMetrics, /metrics
have to use CONTENT_TYPE_LATEST
and generate_latest
from prometheus_client.openmetrics.exposition
module instead of prometheus_client
module. Otherwise using the wrong generate_latest the exemplars dict behind Counter and Histogram will never show up, and using the wrong CONTENT_TYPE_LATEST will cause Prometheus scraping to fail.
# fastapi_app/utils.py
from prometheus_client import REGISTRY
from prometheus_client.openmetrics.exposition import CONTENT_TYPE_LATEST, generate_latest
def metrics(request: Request) -> Response:
return Response(generate_latest(REGISTRY), headers={"Content-Type": CONTENT_TYPE_LATEST})
Metrics with exemplars
There are two methods to add trace information to spans and logs using the OpenTelemetry Python SDK:
- Code-based Instrumentation: This involves adding trace information to spans, logs, and metrics using the OpenTelemetry Python SDK. It requires more coding effort but allows for the addition of exemplars to metrics. We employ this approach in this project.
- Zero-code Instrumentation: This method automatically instruments a Python application using instrumentation libraries, but only when the used frameworks and libraries are supported. It simplifies the process by eliminating the need for manual code changes. However, it does not allow for the addition of exemplars to metrics. For more insights into zero-code instrumentation, refer to my other project, OpenTelemetry APM.
Collects metrics from applications.
Define all FastAPI applications metrics scrape jobs in etc/prometheus/prometheus.yml
.
...
scrape_configs:
- job_name: 'app-a'
scrape_interval: 5s
static_configs:
- targets: ['app-a:8000']
- job_name: 'app-b'
scrape_interval: 5s
static_configs:
- targets: ['app-b:8000']
- job_name: 'app-c'
scrape_interval: 5s
static_configs:
- targets: ['app-c:8000']
Add an Exemplars which uses the value of TraceID
label to create a Tempo link.
Grafana data source setting example:
Grafana data sources config example:
name: Prometheus
type: prometheus
typeName: Prometheus
access: proxy
url: http://prometheus:9090
password: ''
user: ''
database: ''
basicAuth: false
isDefault: true
jsonData:
exemplarTraceIdDestinations:
- datasourceUid: tempo
name: TraceID
httpMethod: POST
readOnly: false
editable: true
Receives spans from applications.
Trace to logs setting:
- Data source: target log source
- Tags: key of tags or process level attributes from the trace, which will be log query criteria if the key exists in the trace
- Map tag names: Convert existing key of tags or process level attributes from trace to another key, then used as log query criteria. Use this feature when the values of the trace tag and log label are identical but the keys are different.
Grafana data source setting example:
Grafana data sources config example:
name: Tempo
type: tempo
typeName: Tempo
access: proxy
url: http://tempo
password: ''
user: ''
database: ''
basicAuth: false
isDefault: false
jsonData:
nodeGraph:
enabled: true
tracesToLogs:
datasourceUid: loki
filterBySpanID: false
filterByTraceID: true
mapTagNamesEnabled: false
tags:
- compose_service
readOnly: false
editable: true
Collect logs with Loki Docker Driver from all services.
- Use YAML anchor and alias feature to set logging options for each service.
- Set Loki Docker Driver options
- loki-url: loki service endpoint
- loki-pipeline-stages: processes multiline log from FastAPI application with multiline and regex stages (reference)
x-logging: &default-logging # anchor(&): 'default-logging' for defines a chunk of configuration
driver: loki
options:
loki-url: 'http://localhost:3100/api/prom/push'
loki-pipeline-stages: |
- multiline:
firstline: '^\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}'
max_wait_time: 3s
- regex:
expression: '^(?P<time>\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2},d{3}) (?P<message>(?s:.*))$$'
# Use $$ (double-dollar sign) when your configuration needs a literal dollar sign.
version: "3.4"
services:
foo:
image: foo
logging: *default-logging # alias(*): refer to 'default-logging' chunk
Add a TraceID derived field to extract the trace id and create a Tempo link from the trace id.
Grafana data source setting example:
Grafana data source config example:
name: Loki
type: loki
typeName: Loki
access: proxy
url: http://loki:3100
password: ''
user: ''
database: ''
basicAuth: false
isDefault: false
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: (?:trace_id)=(\w+)
name: TraceID
url: $${__value.raw}
# Use $$ (double-dollar sign) when your configuration needs a literal dollar sign.
readOnly: false
editable: true
- Add Prometheus, Tempo, and Loki to the data source with config file
etc/grafana/datasource.yml
. - Load predefined dashboard with
etc/dashboards.yaml
andetc/dashboards/fastapi-observability.json
.
# grafana in docker-compose.yaml
grafana:
image: grafana/grafana:10.4.2
volumes:
- ./etc/grafana/:/etc/grafana/provisioning/datasources # data sources
- ./etc/dashboards.yaml:/etc/grafana/provisioning/dashboards/dashboards.yaml # dashboard setting
- ./etc/dashboards:/etc/grafana/dashboards # dashboard json files directory
- FastAPI Traces Demo
- Waber - A Uber-like (Car-Hailing APP) cloud-native application with OpenTelemetry
- Intro to exemplars, which enable Grafana Tempo’s distributed tracing at massive scale
- Trace discovery in Grafana Tempo using Prometheus exemplars, Loki 2.0 queries, and more
- The New Stack (TNS) observability app
- Don’t Repeat Yourself with Anchors, Aliases and Extensions in Docker Compose Files
- How can I escape a $ dollar sign in a docker compose file?
- Tempo Trace to logs tags discussion
- Starlette Prometheus
- Tempo Example