feat(instrumentation): add OpenTelemetry tracing and metrics with basic configurations #5175

girishc13 · 2022-09-15T15:40:14Z

Goals:

resolves Add OpenTelemetry to increase observability #5155
Integrate OpenTelemetry API and SDK.
~~[ ] Provide environment variable configurations to enable tracking when required. Use console exporter for now.~~
Trace gRPC requests within the Flow.
Add helpers for creating traces on request methods with default span attributes.
~~[ ] Convert send_health_check_sync or is_ready method to async to prevent the grpc aio interceptor from throwing and capturing an exception.~~
Extract tracing context from the server and make it available for the Executor methods in the kwargs list or arguments.
check and update documentation. See guide and ask the team.

Sample Usage

Flow

jtype: Flow
version: '1'
with:
  protocol: grpc
  port: 54321
  tracing: true
  traces_exporter_host: '0.0.0.0'
  traces_exporter_port: 4317
executors:
  - uses: executor1/config.yml
    name: toyExecutor
  - uses: executor2/config.yml
    name: toyExecutor2

Executor

import functools

from jina import DocumentArray, Executor, requests
from opentelemetry.context.context import Context
from opentelemetry.semconv.trace import SpanAttributes
from opentelemetry.trace import Status, StatusCode


class MyExecutor(Executor):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    @requests
    def foo(self, docs: DocumentArray, tracing_context: Context, **kwargs):
        with self.tracer.start_span("foo", context=tracing_context) as span:
            try:
                span.set_attribute("len_added_docs", len(docs))
                span.set_attribute(SpanAttributes.RPC_METHOD, functools.__name__)

                docs[0].text = 'hello, world!'
                docs[1].text = 'goodbye, world!'
            except Exception as ex:
                span.set_status(Status(StatusCode.ERROR))
                span.record_exception(ex)
            finally:
                span.set_status(Status(StatusCode.OK))

Client

from jina import Client, DocumentArray
import time

if __name__ == '__main__':
    c = Client(
        host='grpc://0.0.0.0:54321',
        tracing=True,
        traces_exporter_host='0.0.0.0',
        traces_exporter_port=4317,
    )

    da = c.post('/', DocumentArray.empty(4))
    print(da.texts)

    time.sleep(3)

Collecting Data

Please check the docker-compose.yml and otel-collector-config.yml under the folder tests/integration/instrumentation for running the OpenTelemetry collector and Jaeger UI locally.

…orter

…package

…independently

codecov · 2022-09-16T14:01:59Z

Codecov Report

Merging #5175 (bb0b003) into master (bcf17c3) will increase coverage by 23.24%.
The diff coverage is 52.91%.

@@             Coverage Diff             @@
##           master    #5175       +/-   ##
===========================================
+ Coverage   51.99%   75.23%   +23.24%     
===========================================
  Files          95      100        +5     
  Lines        6145     6433      +288     
===========================================
+ Hits         3195     4840     +1645     
+ Misses       2950     1593     -1357

Flag	Coverage Δ
jina	`75.23% <52.91%> (+23.24%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
jina/clients/base/http.py	`91.89% <ø> (+2.70%)`	⬆️
jina/clients/base/websocket.py	`83.80% <ø> (+7.61%)`	⬆️
jina/orchestrate/flow/base.py	`90.15% <ø> (+29.81%)`	⬆️
jina/serve/instrumentation/_aio_server.py	`0.00% <0.00%> (ø)`
jina/serve/runtimes/gateway/http/gateway.py	`87.30% <ø> (+63.49%)`	⬆️
jina/serve/runtimes/gateway/websocket/gateway.py	`85.71% <ø> (+58.92%)`	⬆️
jina/serve/runtimes/gateway/http/app.py	`38.19% <25.00%> (+29.62%)`	⬆️
jina/serve/runtimes/gateway/websocket/app.py	`29.03% <25.00%> (+20.42%)`	⬆️
jina/clients/base/grpc.py	`82.85% <33.33%> (+5.71%)`	⬆️
jina/serve/runtimes/asyncio.py	`69.02% <44.44%> (+20.06%)`	⬆️
... and 76 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

girishc13 · 2022-09-19T08:30:30Z

The python OpenTelemetry sdk currently doesn't have support for the grpc.aio.Server. There is an open pull request which will make the implementation easy for us.

…tio library

…ent to the request method

…correctly

…ent implementations

jina/serve/instrumentation/__init__.py

jina/serve/instrumentation/_aio_client.py

…untime

…nt discovery client requests

samsja

I think that we should merge the new metrics argument with the already existing monitoring one.

Great PR I am looking forward to it

samsja · 2022-10-10T11:28:22Z

docs/fundamentals/flow/executor-args.md

+| `tracing` | If set, the sdk implementation of the OpenTelemetry tracer will be available and will be enabled for automatic tracing of requests and customer span creation. Otherwise a no-op implementation will be provided. | `boolean` | `False` |
+| `span_exporter_host` | If tracing is enabled, this hostname will be used to configure the trace exporter agent. | `string` | `None` |
+| `span_exporter_port` | If tracing is enabled, this port will be used to configure the trace exporter agent. | `number` | `None` |
+| `metrics` | If set, the sdk implementation of the OpenTelemetry metrics will be available for default monitoring and custom measurements. Otherwise a no-op implementation will be provided. | `boolean` | `False` |


I don't understand the sentence. Isn't it going to overlap with the monitoring ?

Yes, my intention is to use the same terms as OpenTelemetry. If people read the OpenTelemetry documentation then the terms are aligned.

Will be renamed to traces_exporter_host?

JohannesMessner · 2022-10-10T11:43:30Z

docs/fundamentals/flow/executor-args.md

+| `span_exporter_host` | If tracing is enabled, this hostname will be used to configure the trace exporter agent. | `string` | `None` |
+| `span_exporter_port` | If tracing is enabled, this port will be used to configure the trace exporter agent. | `number` | `None` |


I know this is a small thing that I mentioned already, so sorry to be a PITA about this, but I really think we should switch these around to ''host/port_span_exporter" to align them with the nomenclature of the prometheus feature. It's the small things that make a good user experience imo

@JohannesMessner what name would u suggest ?

The port_monitoring won't exist in the near future and there will be only the span_exporter attributes. I'm generally used to seeing and using _host as the suffix rather than as a prefix.

but then we might introduce a breaking change right ? We need to be careful

we can deprecated an argument if needed but this should be thinked ahead.

@girishc13 could you show here what would be the relevant arguments on this near future where port_monitoring does not exist ?

Also you can think of the naming in terms on the yaml configuration for the OpenTelemetry collector. The hierarchy that I'm implicitly used to is: dependency -> service -> host, port, .... So this naturally follows the convention of service.host and service.port.

version: "3" services: # Jaeger jaeger: image: jaegertracing/all-in-one:latest ports: - "16686:16686" - "14250" otel-collector: image: otel/opentelemetry-collector:0.61.0 command: [ "--config=/etc/otel-collector-config.yml" ] volumes: - ${PWD}/otel-collector-config.yml:/etc/otel-collector-config.yml ports: - "1888:1888" # pprof extension - "8888:8888" # Prometheus metrics exposed by the collector - "8889:8889" # Prometheus exporter metrics - "13133:13133" # health_check extension - "55679:55679" # zpages extension - "4317:4317" # OTLP gRPC receiver - "4318:4318" # OTLP http receiver depends_on: - jaeger

I'm not a big fan of deprecating our current port_monitoring so quickly after it being introduced, but if it leads to a nicer and more unified experience moving forward then we'll have to do it.

But apart from the argument naming, am I understanding correctly that, according to this plan, the user won't be able to use prometheus to collect metrics anymore? Or will the setup on the user side remain the same, and we only change the way we expose these metrics from our internals?
Because on the otel collector site I still see some prometheus logos but some of them are not connected to the system, so I am a bit lost.

If this is the case, then I don't think we should remove the current way users set up their metrics pipeline. This would be a huge breaking change.

But apart from the argument naming, am I understanding correctly that, according to this plan, the user won't be able to use prometheus to collect metrics anymore? Or will the setup on the user side remain the same, and we only change the way we expose these metrics from our internals?

The main concern from my understanding is introducing a breaking change for the metrics data which requires new setup. Do we have data on how many users are using the Prometheus client for monitoring except for JCloud users? Also the lack of interior between OpenTelemetry monitoring and Prometheus monitoring makes it a bit hard to just remove the current monitoring setup.

I can think of the following ways to tackle this:

We can also choose to release only the tracing instrumentation and work on the metrics later if we get feedback from the users. I also believe that the OpenTelemetry metrics does not provide rich features when compared to Prometheus but it's still the direction to go early to avoid the users from investing too much into the Prometheus only solution.

We deprecate Prometheus monitoring and continue supporting OpenTelemetry tracing and monitoring for users that want to work with OpenTelemetry. The decision is up to the user and we might have some more work to maintain both.

I would declared the old metric system as deprecated (TO BE REMOVED in a couple of minors) and go with full OpenTelemetry approach

The official Prometheus library already supports OpenTelemetry api's and sdk's. The OpenTelemetry Collector also supports scraping data from the existing Prometheus client. We might need some elaborate configuration for metrics and OpenTelemetry Collector to support the existing mechanism but OpenTelemetry is the way to go.

github-actions · 2022-10-10T12:33:32Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

jina/serve/executors/__init__.py

github-actions · 2022-10-10T14:33:34Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

tests/integration/instrumentation/test_flow_instrumentation.py

github-actions · 2022-10-10T14:55:50Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2022-10-10T15:44:45Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

jina/serve/gateway.py

jina/serve/runtimes/gateway/__init__.py

alaeddine-13 · 2022-10-11T11:06:17Z

jina/serve/runtimes/gateway/http/__init__.py

-    async def async_run_forever(self):
-        """Running method of the server."""
-        await self.gateway.run_server()
+from .gateway import HTTPGateway


I probably missed this, but I believe it's still possible, it does not produce circular imports for other gateways

jina/serve/runtimes/gateway/http/__init__.py

github-actions · 2022-10-11T11:25:31Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2022-10-11T12:33:03Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2022-10-11T13:09:48Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2022-10-11T13:13:15Z

📝 Docs are deployed on https://feat-instrumentation-5155--jina-docs.netlify.app 🎉

samsja

Lgtm

feat(instrumentation): create basic tracer and meter with console exp…

a5a7f42

…orter

github-actions bot added size/S area/core This issue/PR affects the core codebase area/entrypoint This issue/PR affects the entrypoint codebase area/helper This issue/PR affects the helper functionality area/setup This issue/PR affects setting up Jina labels Sep 15, 2022

jina-bot and others added 3 commits September 15, 2022 15:41

style: fix overload and cli autocomplete

9e1b2d0

feat(instrumentation): move the instrumentation package to the serve …

514792a

…package

feat(instrumentation): provide options to enable tracing and metrics …

c3b0c37

…independently

JoanFM closed this Sep 16, 2022

JoanFM reopened this Sep 16, 2022

github-actions bot added the component/resource label Sep 16, 2022

Girish Chandrashekar added 2 commits September 19, 2022 10:56

feat(instrumentation): add the correct grpc opentelmetery insturmenta…

2269b57

…tio library

feat(serve): instrument grpc server and channel with interceptors

14cb744

github-actions bot added the size/L label Sep 19, 2022

jina-bot and others added 4 commits September 19, 2022 14:29

style: fix overload and cli autocomplete

f53be22

feat(instrumentation): provide opentelemety context from the grpc cli…

a4a4621

…ent to the request method

feat(instrumentation): check for opentelemetry environment variables …

78efb44

…correctly

feat(instrumentation): create InstrumentationMixin for server and cli…

7116e9f

…ent implementations

JoanFM reviewed Sep 20, 2022

View reviewed changes

jina/serve/instrumentation/__init__.py Outdated Show resolved Hide resolved

jina/serve/instrumentation/__init__.py Outdated Show resolved Hide resolved

jina/serve/instrumentation/_aio_client.py Show resolved Hide resolved

Girish Chandrashekar added 2 commits September 21, 2022 09:49

chore(instrumentation): use absolute module import

92d3679

feat(instrumentation): trace http and websocket server and clients

eb0ccd3

girishc13 closed this Sep 21, 2022

girishc13 reopened this Sep 21, 2022

Girish Chandrashekar added 4 commits September 21, 2022 15:52

chore(instrumentation): update/add new opentelemetry arguments

38cae61

feat(instrumentation): globally disable tracing health check requests

45d1794

feat(instrumentation): add InstrumentationMixIn for Head and Worker r…

b107f80

…untime

feat(instrumentation): disable tracing of ServerReflection and endpoi…

cd17588

…nt discovery client requests

samsja suggested changes Oct 10, 2022

View reviewed changes

JohannesMessner suggested changes Oct 10, 2022

View reviewed changes

feat: add default tracing span for DataRequestHandler handle invocation

70146e4

JoanFM requested changes Oct 10, 2022

View reviewed changes

jina/serve/executors/__init__.py Show resolved Hide resolved

test: add test case to verify exception recording in a span

7f20c06

JoanFM requested changes Oct 10, 2022

View reviewed changes

tests/integration/instrumentation/test_flow_instrumentation.py Outdated Show resolved Hide resolved

fix: use continue_on_error instead of try-except-pass

550a975

Merge branch 'master' into feat-instrumentation-5155

b644004

alaeddine-13 reviewed Oct 11, 2022

View reviewed changes

JoanFM requested changes Oct 11, 2022

View reviewed changes

jina/serve/runtimes/gateway/http/__init__.py Show resolved Hide resolved

chore: rename method name to match returning a list

d55d86c

fix: rename span_exporter args to traces_exporter

132a932

style: fix overload and cli autocomplete

bb0b003

JoanFM approved these changes Oct 11, 2022

View reviewed changes

JoanFM closed this Oct 11, 2022

JoanFM reopened this Oct 11, 2022

samsja approved these changes Oct 11, 2022

View reviewed changes

JoanFM approved these changes Oct 11, 2022

View reviewed changes

JohannesMessner approved these changes Oct 11, 2022

View reviewed changes

JoanFM merged commit 107631e into master Oct 11, 2022

JoanFM deleted the feat-instrumentation-5155 branch October 11, 2022 14:14

alexcg1 mentioned this pull request Oct 24, 2022

chore: draft release note 3.11.0 #5303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(instrumentation): add OpenTelemetry tracing and metrics with basic configurations #5175

feat(instrumentation): add OpenTelemetry tracing and metrics with basic configurations #5175

girishc13 commented Sep 15, 2022 •

edited

Loading

codecov bot commented Sep 16, 2022 •

edited

Loading

girishc13 commented Sep 19, 2022

samsja left a comment

samsja Oct 10, 2022

girishc13 Oct 10, 2022

girishc13 Oct 11, 2022

JohannesMessner Oct 10, 2022

samsja Oct 10, 2022

girishc13 Oct 10, 2022

samsja Oct 10, 2022

samsja Oct 10, 2022

girishc13 Oct 10, 2022

JohannesMessner Oct 11, 2022

girishc13 Oct 11, 2022

JoanFM Oct 11, 2022

girishc13 Oct 11, 2022

github-actions bot commented Oct 10, 2022

github-actions bot commented Oct 10, 2022

github-actions bot commented Oct 10, 2022

github-actions bot commented Oct 10, 2022

alaeddine-13 Oct 11, 2022

github-actions bot commented Oct 11, 2022

github-actions bot commented Oct 11, 2022

github-actions bot commented Oct 11, 2022

github-actions bot commented Oct 11, 2022

samsja left a comment

		\| `span_exporter_host` \| If tracing is enabled, this hostname will be used to configure the trace exporter agent. \| `string` \| `None` \|
		\| `span_exporter_port` \| If tracing is enabled, this port will be used to configure the trace exporter agent. \| `number` \| `None` \|

feat(instrumentation): add OpenTelemetry tracing and metrics with basic configurations #5175

feat(instrumentation): add OpenTelemetry tracing and metrics with basic configurations #5175

Conversation

girishc13 commented Sep 15, 2022 • edited Loading

Sample Usage

Flow

Executor

Client

Collecting Data

codecov bot commented Sep 16, 2022 • edited Loading

Codecov Report

girishc13 commented Sep 19, 2022

samsja left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 10, 2022

github-actions bot commented Oct 10, 2022

github-actions bot commented Oct 10, 2022

github-actions bot commented Oct 10, 2022

Choose a reason for hiding this comment

github-actions bot commented Oct 11, 2022

github-actions bot commented Oct 11, 2022

github-actions bot commented Oct 11, 2022

github-actions bot commented Oct 11, 2022

samsja left a comment

Choose a reason for hiding this comment

girishc13 commented Sep 15, 2022 •

edited

Loading

codecov bot commented Sep 16, 2022 •

edited

Loading