-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(instrumentation): add OpenTelemetry tracing and metrics with basic configurations #5175
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5175 +/- ##
===========================================
+ Coverage 51.99% 75.23% +23.24%
===========================================
Files 95 100 +5
Lines 6145 6433 +288
===========================================
+ Hits 3195 4840 +1645
+ Misses 2950 1593 -1357
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
The python OpenTelemetry sdk currently doesn't have support for the |
…ent to the request method
…ent implementations
…nt discovery client requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should merge the new metrics
argument with the already existing monitoring
one.
Great PR I am looking forward to it
| `tracing` | If set, the sdk implementation of the OpenTelemetry tracer will be available and will be enabled for automatic tracing of requests and customer span creation. Otherwise a no-op implementation will be provided. | `boolean` | `False` | | ||
| `span_exporter_host` | If tracing is enabled, this hostname will be used to configure the trace exporter agent. | `string` | `None` | | ||
| `span_exporter_port` | If tracing is enabled, this port will be used to configure the trace exporter agent. | `number` | `None` | | ||
| `metrics` | If set, the sdk implementation of the OpenTelemetry metrics will be available for default monitoring and custom measurements. Otherwise a no-op implementation will be provided. | `boolean` | `False` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the sentence. Isn't it going to overlap with the monitoring
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my intention is to use the same terms as OpenTelemetry. If people read the OpenTelemetry documentation then the terms are aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be renamed to traces_exporter_host
?
| `span_exporter_host` | If tracing is enabled, this hostname will be used to configure the trace exporter agent. | `string` | `None` | | ||
| `span_exporter_port` | If tracing is enabled, this port will be used to configure the trace exporter agent. | `number` | `None` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a small thing that I mentioned already, so sorry to be a PITA about this, but I really think we should switch these around to ''host/port_span_exporter" to align them with the nomenclature of the prometheus feature. It's the small things that make a good user experience imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohannesMessner what name would u suggest ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The port_monitoring won't exist in the near future and there will be only the span_exporter
attributes. I'm generally used to seeing and using _host
as the suffix rather than as a prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then we might introduce a breaking change right ? We need to be careful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can deprecated an argument if needed but this should be thinked ahead.
@girishc13 could you show here what would be the relevant arguments on this near future where port_monitoring
does not exist ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you can think of the naming in terms on the yaml configuration for the OpenTelemetry collector. The hierarchy that I'm implicitly used to is: dependency -> service -> host, port, ...
. So this naturally follows the convention of service.host
and service.port
.
version: "3"
services:
# Jaeger
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
- "14250"
otel-collector:
image: otel/opentelemetry-collector:0.61.0
command: [ "--config=/etc/otel-collector-config.yml" ]
volumes:
- ${PWD}/otel-collector-config.yml:/etc/otel-collector-config.yml
ports:
- "1888:1888" # pprof extension
- "8888:8888" # Prometheus metrics exposed by the collector
- "8889:8889" # Prometheus exporter metrics
- "13133:13133" # health_check extension
- "55679:55679" # zpages extension
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP http receiver
depends_on:
- jaeger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of deprecating our current port_monitoring
so quickly after it being introduced, but if it leads to a nicer and more unified experience moving forward then we'll have to do it.
But apart from the argument naming, am I understanding correctly that, according to this plan, the user won't be able to use prometheus to collect metrics anymore? Or will the setup on the user side remain the same, and we only change the way we expose these metrics from our internals?
Because on the otel collector site I still see some prometheus logos but some of them are not connected to the system, so I am a bit lost.
If this is the case, then I don't think we should remove the current way users set up their metrics pipeline. This would be a huge breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But apart from the argument naming, am I understanding correctly that, according to this plan, the user won't be able to use prometheus to collect metrics anymore? Or will the setup on the user side remain the same, and we only change the way we expose these metrics from our internals?
The main concern from my understanding is introducing a breaking change for the metrics data which requires new setup. Do we have data on how many users are using the Prometheus client for monitoring except for JCloud users? Also the lack of interior between OpenTelemetry monitoring and Prometheus monitoring makes it a bit hard to just remove the current monitoring setup.
I can think of the following ways to tackle this:
- We can also choose to release only the tracing instrumentation and work on the metrics later if we get feedback from the users. I also believe that the OpenTelemetry metrics does not provide rich features when compared to Prometheus but it's still the direction to go early to avoid the users from investing too much into the Prometheus only solution.
- We deprecate Prometheus monitoring and continue supporting OpenTelemetry tracing and monitoring for users that want to work with OpenTelemetry. The decision is up to the user and we might have some more work to maintain both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would declared the old metric system as deprecated (TO BE REMOVED in a couple of minors) and go with full OpenTelemetry approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The official Prometheus library already supports OpenTelemetry api's and sdk's. The OpenTelemetry Collector also supports scraping data from the existing Prometheus client. We might need some elaborate configuration for metrics and OpenTelemetry Collector to support the existing mechanism but OpenTelemetry is the way to go.
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
async def async_run_forever(self): | ||
"""Running method of the server.""" | ||
await self.gateway.run_server() | ||
from .gateway import HTTPGateway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably missed this, but I believe it's still possible, it does not produce circular imports for other gateways
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
📝 Docs are deployed on https://feat-instrumentation-5155--jina-docs.netlify.app 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
Goals:
[ ] Provide environment variable configurations to enable tracking when required. Use console exporter for now.[ ] Convertsend_health_check_sync
oris_ready
method to async to prevent the grpc aio interceptor from throwing and capturing an exception.kwargs
list or arguments.Sample Usage
Flow
Executor
Client
Collecting Data
Please check the
docker-compose.yml
andotel-collector-config.yml
under the foldertests/integration/instrumentation
for running the OpenTelemetry collector and Jaeger UI locally.