As I started to investigate OpenTelemetry (OTel) deployment and hosting patterns, one of the questions I hoped to answer was: what is the recommended security posture for OpenTelemetry collectors (both Gateway and Agent), especially when deployed into a public network?
Although the OpenTelemetry Security Guidance states that "Receivers/Exporters...SHOULD use encryption and authentication", I couldn't find a lot of functional examples that implemented this guidance.
Agent collectors tolerate insecure communication since they typically run on the same host as the application (as a binary, sidecar or daemonset) and communicate over the local network. To secure a Gateway collector deployed on a public network, Juraci Paixão Kröhling's Securing your OpenTelemetry guide from late 2020 provided the most detailed and nuanced instructions I was able to find.
However, some of the authentication patterns and extensions, e.g.,per_rpc_auth
and oidc
, have changed since the original guide. If you follow the instructions to the letter, you'll likely end up wih a not-quite functional authentication flow:
gateway_1 | Error: cannot load configuration: error reading receivers configuration for otlp: 1 error(s) decoding:
gateway_1 | * 'protocols.grpc.auth' has invalid keys: oidc
agent_1 | Error: cannot load configuration: error reading exporters configuration for otlp/auth: 1 error(s) decoding:
agent_1 | * '' has invalid keys: per_rpc_auth
It's a frustrating feeling as a developer (and also where I started with this demo). At some point I'm sure the security guidance in this updated guide will be superseded—if it does, I hope some enterprising reader considers contributing a v3 guide back to the OpenTelemetry community 😍.
This updated guide to Securing Your Open Telemetry Collector... relies heavily on the prior art from the original article and will frequently reference it directly when the original advice remains unchanged.
By the end of this demo, traces originating from your application should flow (securely) from the Agent collector to the Gateway collector. The Gateway collector will export the traces to a Jaeger backend, which is an open source distributed tracing system that provides both a backend data store and visualization UI.
To restate the data flow again:
- An application exports data to the OpenTelemetry Agent collector.
- An OpenTelemetry Agent collector receive trace data and exports it to the Gateway Collector using a bearer token
- An OpenTelemetry Gateway collector validates the token, receives trace data from an Agent, and exports it to the Jaeger service
- A Jaeger service receives the data and stores it in an in-memory data store
- The Jaeger visualization UI displays trace data to end users
In this example implementation, the Agent and Gateway collectors are deployed into the same local network with a security boundary (token authentication) enforced between them.
More typically, the Agent and Gateway collectors will be deployed separately and not communicate over a local network. Fortunately for us, authentication behaviors between the Agent and Gateway collectors should not change regardless of its network configuration, so you can redeploy your gateway separately and validate the same behavior.
One other note about this guide before we get started: most of the complexity is setting up your authentication service locally and securing local network traffic using self-signed certificates (OpenTelemetry bars using authentication pattern over insecure networks).
Once you complete these setup steps, however, we should be able to securely transmit traces from our Agent to Gateway collectors.
The accompanying GitHub repo for this post provides all the necessary code samples for you to proceed. Clone the repo to get started with adding a security boundary between your Agent and Gateway collectors.
This guide requires several tools to correctly configure your working environment and assumes some fluency with each. These tools include:
-
Docker to manage containers used throughout this demo
-
CloudFoundation cfssl to generate self-signed certs for your OpenTelemetry collectors (you can't use authentication extensions over insecure networks)
# go 1.17+ go install github.com/cloudflare/cfssl/cmd/...@latest #go prior to 1.17 go get github.com/cloudflare/cfssl/cmd/...
-
TraceGen to mimic application traffic and generate sample traces
# go 1.17+ go install github.com/open-telemetry/opentelemetry-collector-contrib/tracegen@v0.34.0 # go prior to 1.17 go get github.com/open-telemetry/opentelemetry-collector-contrib/tracegen
If you try to use the insecure
OTLP exporter property for the Agent collector in this demo, you will likely see OpenTelemetry Agent logs that state:
Error: cannot setup pipelines: cannot start builtExporters: grpc: the credentials require transport level security (use grpc.WithTransportCredentials() to set)
We'll use self-signed certs to secure our service-to-service communication over our local network. Importantly, the host name for these self-signed certs is gateway
(the service name for our OpenTelemetry Gateway collector in our Docker compose file). Using a different name, e.g., localhost
will generate the following error once your collectors try to establish a secure connection: "authentication handshake failed: x509: certificate is valid for localhost, not gateway"
To generate the appropriate certificate files, execute the following commands from a terminal:
pushd certs
./generate-certs.sh
popd`
Now that we've generated the certificate artifacts for secure network communication, let's launch the authentication service, the sine qua non of this guide. Like the original guide, we'll use the open source identity and access management tool Keycloak for this demo. Note: you're welcome to bring your own authentication service if you have one lolling around; if you do, you may need to make minor adjustments to subsequent steps.
We'll launch our local authentication service into a user-defined bridge network. This allows cross-container communication between our authentication service and some of the other services we'll create in this demo.
network_name="pullman"
docker network create $network_name
Now that we've created a network, start your local authentication service to use this network.
auth_service_name="keycloak"
docker run -p 8080:8080 -e KEYCLOAK_USER=admin -e KEYCLOAK_PASSWORD=admin --network $network_name --name $auth_service_name jboss/keycloak
The instructions from the original guide on configuring your authentication server are still valid, so please complete them before heading back here. One detail from the original guide I highly endorse noting: "by default, Keycloak tokens are valid for only 5 minutes." This is fairly short, especially if you need to diagnose misconfiguration issues. You can extend the token lifespan to up to 10 hours under the Advanced Settings for the agent
client.
Once completed, you should have a realm
and two clients named collector
and agent
respectively. Make sure to note the secret associated with the latter, as you'll need it to generate the bearer token provided by your Agent collector.
The Docker containers running our OpenTelemetry collectors retrieves required authentication values from environment variables (provided via a file named service.env
with key-value pairs tha mirror the structure below):
service.env
GATEWAY_BEARER_TOKEN="value"
ISSUER_URL="value"
If you're using your own authentication service, you can create manually created file with the required key-value pairs and proceed to the next section (Run Services). If you use the local authentication service we created earlier, these next steps should simplify generating a well-formed environment variable file with valid key-value pairs. Note that we run this Docker image in the same user-defined bridge network. This ensures that the token we generate is for the same domain that our service runs on.
# Value from previous step
client_secret=?
token_endpoint="http://${auth_service_name}:8080/auth/realms/opentelemetry/protocol/openid-connect/token"
issuer_url="http://${auth_service_name}:8080/auth/realms/opentelemetry"
docker build . -t otel-auth:latest
docker run `
-v $PWD/:/tmp/otel `
-e token_endpoint=$token_endpoint `
-e issuer_url=$issuer_url `
-e client_id=agent `
-e client_secret=$client_secret `
-w /tmp/otel `
--net $network_name `
otel-auth:latest
At this point, we've completed all the necessary configuration to secure out local network, run our authentication service, and generate a bearer token to provide on requests from your Agent collector.
We'll use a Docker compose file to orchestrate bringing up the remaining services for this authentication demo:
- Jaeger (backend for trace data and visualization layer)
- OpenTelemetry Gateway collector (validates the bearer token)
- OpenTelemetry Agent collector (provides the bearer token)
# note: the compose file will correctly orchestrate the service dependencies while bringing them online
docker-compose up
There's a few log lines that will help us understand if the services are operating as expected (condensed for clarity):
gateway_1 | Everything is ready. Begin running and processing data.
gateway_1 | State of the connection with the Jaeger Collector backend {"kind": "exporter", "name": "jaeger", "state": "READY"}
agent_1 | Everything is ready. Begin running and processing data.
If you see these lines, you're in the clear to start receiving traces from your application 🙌. If not (and it happens to all of us), you've got a bit of debugging and troubleshooting to do before you're ready to roll.
The moment we've all been waiting for: sending traffic from an application to your Agent collector, then Gateway collector, and then storage backend. We'll simulate some application traffic using tracegen
.
To get started, we'll generate a single trace:
tracegen -otlp-insecure -traces 1
You should be able to navigate to the Jaeger Visualization UI, and see the trace data for your service.
Once you are more familiar with the flow, you can generate larger volumes of trace data by using the -duration
CLI flag.
tracegen -otlp-insecure -duration 5s
For additional fingers-to-keyboard experience with instrumenting an application, follow the Python code sample for an OTLP exporter.
At this point, you likely have a working implementation. If you don't, I'd love to hear where your stuck. since there's a decent chance you're not the only person to get hung up on potentially unclear or ambiguous instructions. If you drop a note in the comments, I can try to make the guide more explicit and obvious for the next reader.
My experience building software is that it's much easier to understand important systems behaviors when they're broken rather than functioning as expected: what assumptions about the behavior of our system do not hold true under certain conditions?
I've provided a few suggestions for experiments you might try to break our sample implementation and observe the system's failure modes 😍.