Draft: Migrate to OpenTelemetry tracing #13400

MadLittleMods · 2022-07-26T23:44:58Z

Migrate to OpenTelemetry tracing

Now requires the latest Twisted (22.8.0+) in order for tracing to work
- From the backend chapter sync last month, it was discussed that it would probably be fine to require the latest Twisted for tracing as it's mostly a tool for us.

Blockers

Wait for a new Twisted release that includes the contextvars fixes
- See Draft: Migrate to OpenTelemetry tracing #13400 (comment), Draft: Migrate to OpenTelemetry tracing #13400 (comment)
- Update: twisted==22.8.0 (2022-09-06) was released 🚀 - need to try it out!
Use a non-Python Jaeger exporter for production performance (see switch from jaeger-client to opentelemetry #11850 (comment))
- We can probably adapt https://github.com/erikjohnston/rust-jaeger-python-client to be an OpenTelemetry compatible exporter because it's mostly Jaeger API stuff. Help welcome!
force_tracing_for_users
- Discussion: Draft: Migrate to OpenTelemetry tracing #13400 (comment)

Dev notes

poetry run synapse_homeserver --config-path homeserver.yaml

$ poetry cache list
$ poetry cache clear --all pypi

via https://stackoverflow.com/a/70064450/796832

# Stop the current virtualenv if active
$ deactivate

# Find the venv for poetry
$ poetry env info -p
# Remove all the files of the current environment
$ rm -rf $(poetry env info -p)

# Reactivate Poetry shell
$ poetry shell
# Install everything
$ poetry install --extras all

$ psql synapse

ALTER TABLE device_lists_outbound_pokes RENAME COLUMN opentracing_context TO tracing_context;
ALTER TABLE device_lists_changes_in_room RENAME COLUMN opentracing_context TO tracing_context;

ALTER TABLE device_lists_outbound_pokes RENAME COLUMN tracing_context TO opentracing_context;
ALTER TABLE device_lists_changes_in_room RENAME COLUMN tracing_context TO opentracing_context;

Typing

Type hints for optional imports:

OpenTelemetry

https://opentelemetry.io/docs/instrumentation/python/
https://opentelemetry.io/docs/instrumentation/python/manual/
https://opentelemetry-python.readthedocs.io/en/latest/api/
https://opentelemetry-python.readthedocs.io/en/latest/sdk/index.html
Migrating:
- https://opentelemetry.io/docs/migration/opentracing/
- https://opentelemetry-python.readthedocs.io/en/stable/shim/opentracing_shim/opentracing_shim.html
- Shim source: https://github.com/open-telemetry/opentelemetry-python/tree/c9222bfc18ec91f041c2f0a9eac8560f61dcb338/shim/opentelemetry-opentracing-shim/src/opentelemetry/shim/opentracing_shim
- opentracing source: https://github.com/opentracing/opentracing-python/tree/master/opentracing
  - tags.py
- Migrating some references
  - child_of to supplying the context on a OTEL span
  - follows_from references -> links on an OTEL span
  - What's the difference between follows_from and child_of? See https://opentracing.io/specification/
API docs:
- https://opentelemetry-python.readthedocs.io/en/stable/
Typing:
- Add type hints to opentelemetry-sdk and run as a part of tests open-telemetry/opentelemetry-python#1608

Code:

OTEL code:

Sampling:

UDP on macOS (maybe Docker)

Error when using jaeger.thrift over Thrift-compact protocol with UDP on port 6831

homeserver.yaml

tracing:
   # ...
    jaeger_exporter_config:
      agent_host_name: localhost
      agent_port: 6831

2022-08-02 19:37:14,796 - opentelemetry.sdk.trace.export - 360 - ERROR - sentinel - Exception while exporting Span batch.
Traceback (most recent call last):
  File "/Users/eric/Documents/github/element/synapse/.venv/lib/python3.9/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 358, in _export_batch
    self.span_exporter.export(self.spans_list[:idx])  # type: ignore
  File "/Users/eric/Documents/github/element/synapse/.venv/lib/python3.9/site-packages/opentelemetry/exporter/jaeger/thrift/__init__.py", line 219, in export
    self._agent_client.emit(batch)
  File "/Users/eric/Documents/github/element/synapse/.venv/lib/python3.9/site-packages/opentelemetry/exporter/jaeger/thrift/send.py", line 95, in emit
    udp_socket.sendto(buff, self.address)
OSError: [Errno 40] Message too long

2022-08-02 19:45:50,196 - opentelemetry.exporter.jaeger.thrift.send - 87 - WARNING - sentinel - Data exceeds the max UDP packet size; size 94357, max 65000

Fix: open-telemetry/opentelemetry-python#1061 (comment) -> https://www.jaegertracing.io/docs/1.19/client-libraries/#emsgsize-and-udp-buffer-limits -> jaegertracing/jaeger-client-node#124

$ sysctl net.inet.udp.maxdgram
net.inet.udp.maxdgram: 9216
$ sudo sysctl net.inet.udp.maxdgram=65536
net.inet.udp.maxdgram: 9216 -> 65536
$ sysctl net.inet.udp.maxdgram
net.inet.udp.maxdgram: 65536

To make those changes stick between OS reboots, make sure to add it to /etc/sysctl.conf as well:

/etc/sysctl.conf

net.inet.udp.maxdgram=65536

You also have to set udp_split_oversized_batches to split the batches up if they go over the 65k limit set in OTEL
homeserver.yaml

tracing:
    # ...
    jaeger_exporter_config:
      agent_host_name: localhost
      agent_port: 6831
      # Split UDP packets (UDP_PACKET_MAX_LENGTH is set to 65k in OpenTelemetry)
      udp_split_oversized_batches: true

Using thrift over UDP (port 6831) to communicate to the Jaeger agent doesn't seem to work from within a Complement test Docker container. I wonder why this is the case? I was seeing this same behavior with the Jaeger opentracing stuff. Is the UDP connection being over saturated (1065 spans in one trace)? Can the Jaeger agent in Docker not keep up? We see some spans come over but never the main servlet overarching one that is probably the last to be exported. But using the HTTP Jaeger collector endpoint seems to work fine for getting the whole trace (collector_endpoint: "http://localhost:14268/api/traces?format=jaeger.thrift")

TODO

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
~~Pull request includes a sign off~~
Code style is correct
(run the linters)

See #11850

MadLittleMods · 2022-07-26T23:47:59Z

pyproject.toml

@@ -196,7 +196,7 @@ oidc = ["authlib"]
 systemd = ["systemd-python"]
 url_preview = ["lxml"]
 sentry = ["sentry-sdk"]
-opentracing = ["jaeger-client", "opentracing"]
+opentelemetry = ["opentelemetry-api", "opentelemetry-sdk"]


I don't know what I am doing with the dependencies here. Please double check this stuff.

synapse/app/_base.py

synapse/logging/tracing.py

See #13400 (comment)

synapse/logging/tracing.py

``` Invalid type StreamToken for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types ``` Had to add a few more logs to find this instance since the warning doens't give much info where I am setting this invalid attribute. This was good enough to find it in the code. ``` BoundedAttributes __setitem__ key=since_token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481272, typing_key=0, receipt_key=340, account_data_key=1233, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0) BoundedAttributes __setitem__ key=now_token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481287, typing_key=0, receipt_key=340, account_data_key=1233, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0) BoundedAttributes __setitem__ key=token value=StreamToken(room_key=RoomStreamToken(topological=None, stream=1787, instance_map=frozendict.frozendict({})), presence_key=481291, typing_key=0, receipt_key=340, account_data_key=1237, push_rules_key=8, to_device_key=57, device_list_key=199, groups_key=0) ```

Fix error: ``` AttributeError: 'SpanContext' object has no attribute 'get' ``` `Context`: ``` {'current-span-1a226c96-a5db-4412-bcaa-1fdd34213c5c': _Span(name="sendToDevice", context=SpanContext(trace_id=0x5d2dcc3fdc8205046d60a5cd18672ac6, span_id=0x715c736ff5f4d208, trace_flags=0x01, trace_state=[], is_remote=False))} ``` `SpanContext`: ``` SpanContext(trace_id=0xf7cd9d058b7b76f364bdd649c4ba7b8a, span_id=0x287ce71bac31bfc4, trace_flags=0x01, trace_state=[], is_remote=False) ```

The `incoming-federation-request` vs `process-federation_request` was first introduced in #11870 - Span for remote trace: `incoming-federation-request` - `child_of` reference: `origin_span_context` - `follows_from` reference: `servlet_span` - Span for local trace: `process-federation-request` - `child_of` reference: `servlet_span` (by the nature of it being active) - `follows_from` reference: `incoming-federation-request`

synapse/federation/transport/server/_base.py

…ntelemetry works

See: - open-telemetry/opentelemetry-python#198 (comment) - open-telemetry/opentelemetry-python#219

synapse/federation/transport/server/_base.py

See #13400 (comment)

MadLittleMods · 2022-08-09T01:42:28Z

synapse/storage/database.py

@@ -701,15 +701,15 @@ def new_transaction(
                    exception_callbacks=exception_callbacks,
                )
                try:
-                    with opentracing.start_active_span(
+                    with tracing.start_active_span(
                        "db.txn",


When viewing a trace for a servlet, we have the db.{desc} span but the child database spans are missing like db.connection, db.txn, db.query 🤔

Probably need to assert that the parents are set correctly in the tests as well (particularly with deferreds).

Conflicts: synapse/logging/opentracing.py tests/logging/test_opentracing.py

Conflicts: poetry.lock synapse/federation/federation_client.py synapse/federation/federation_server.py synapse/handlers/federation.py synapse/handlers/federation_event.py synapse/logging/opentracing.py synapse/rest/client/room.py synapse/storage/controllers/persist_events.py synapse/storage/controllers/state.py

Conflicts: poetry.lock synapse/api/auth.py synapse/federation/federation_client.py synapse/logging/opentracing.py synapse/rest/client/keys.py synapse/rest/client/sendtodevice.py synapse/storage/schema/__init__.py

pyproject.toml

Conflicts: synapse/storage/schema/__init__.py

``` Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 306, in _async_render_wrapper callback_return = await self._async_render(request) File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 512, in _async_render callback_return = await raw_callback_return File "/usr/local/lib/python3.9/site-packages/synapse/federation/transport/server/_base.py", line 357, in new_func remote_parent_span = create_non_recording_span() File "/usr/local/lib/python3.9/site-packages/synapse/logging/tracing.py", line 502, in create_non_recording_span return opentelemetry.trace.NonRecordingSpan( AttributeError: 'NoneType' object has no attribute 'trace' ```

Hopefully fix: ``` File "/home/runner/work/synapse/synapse/synapse/storage/controllers/persist_events.py", line 246, in add_to_queue links=[Link(end_item.tracing_span_context)], builtins.TypeError: __init__() takes 1 positional argument but 2 were given ```

Conflicts: synapse/storage/schema/__init__.py

Conflicts: .github/workflows/tests.yml poetry.lock synapse/storage/schema/__init__.py

Conflicts: synapse/handlers/message.py synapse/logging/opentracing.py

Conflicts: poetry.lock (conflicts not fixed) synapse/handlers/message.py synapse/handlers/relations.py synapse/storage/databases/main/devices.py synapse/storage/schema/__init__.py

Conflicts: docs/usage/configuration/config_documentation.md poetry.lock synapse/handlers/message.py synapse/http/server.py synapse/logging/opentracing.py synapse/rest/client/keys.py synapse/rest/client/room.py

Conflicts: poetry.lock

MadLittleMods · 2023-05-20T00:05:52Z

synapse/logging/tracing.py

+    span = start_span(
+        name=name,
+        context=context,
+        kind=kind,
+        attributes=attributes,
+        links=links,
+        start_time=start_time,
+        record_exception=record_exception,
+        set_status_on_exception=set_status_on_exception,
+        tracer=tracer,
+    )
+
+    # Equivalent to `tracer.start_as_current_span`
+    return opentelemetry.trace.use_span(
+        span,
+        end_on_exit=end_on_exit,
+        record_exception=record_exception,
+        set_status_on_exception=set_status_on_exception,
+    )


Moving this over from #13440 (comment),

Occasionally, I see spans that have a super massive duration and shows up with 213503982d 8h in the Jaeger UI. Obviously looks like some sort of max duration bug.

The raw data for those spans are the following. Only these spans are messed up and the end_time does not affect any more parent spans before _process_pulled_event.

db.get_partial_state_events

{ "traceID": "563478e5d75db03e9fc028822fde0649", "spanID": "5b63b0172763c5cf", "flags": 1, "operationName": "db.get_partial_state_events", "references": [ { "refType": "CHILD_OF", "traceID": "563478e5d75db03e9fc028822fde0649", "spanID": "10c9f2a2c50e286f" } ], "startTime": 1659994809436162, "duration": 18446744073709517944, "tags": [ // ... ], "logs": [], "processID": "p1", "warnings": null }

_process_pulled_event

{ "traceID": "563478e5d75db03e9fc028822fde0649", "spanID": "6e9ee2608b03b542", "flags": 1, "operationName": "_process_pulled_event", "references": [ { "refType": "CHILD_OF", "traceID": "563478e5d75db03e9fc028822fde0649", "spanID": "af2c9c8d08284134" } ], "startTime": 1659994809432356, "duration": 18446744073709532019, "tags": [ // ... ], "logs": [], "processID": "p1", "warnings": null },

Definitely close to the Python "max int size" (unsigned word, 64 bits): sys.maxsize * 2 + 1 (or 2^64 - 1) -> 18446744073709551615 (via https://stackoverflow.com/questions/7604966/maximum-and-minimum-values-for-ints)

It seems like it's some negative value getting turned into an unsigned integer so it wraps to the end of the value range. But I don't see the math where that can happen yet.

start_time and end_time are calculated by time.time_ns() (_time_ns)

And the Jaeger exporter calculates startTime and duration with this small _nsec_to_usec_round function.

To see if it's just some endTime (0) - startTime (some positive number) = some negative number, type of problem: If we take the Python max value minus the duration we see from _process_pulled_event: 18446744073709551615 - 18446744073709532019 -> 19596 microseconds. TODO: Is that reasonable?

MadLittleMods · 2023-06-23T22:09:34Z

Closing as our existing OpenTracing setup seems to just work over UDP inside Docker (for Complement tests) on my Linux machine now so I don't have a personal need for this anymore (didn't work properly on macOS).

This is probably best tackled with a fresh PR with some of the learnings from here anyway

Migrate to OpenTelemetry tracing

0cc610e

See #11850

MadLittleMods commented Jul 26, 2022

View reviewed changes

MadLittleMods added 4 commits July 26, 2022 21:53

Some shim and some new

2fe6911

Progress towards OTEL

6984cef

Server running

6406fd5

Export to Jaeger (things are showing up)

2428172

MadLittleMods added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jul 27, 2022

MadLittleMods commented Jul 27, 2022

View reviewed changes

synapse/app/_base.py Outdated Show resolved Hide resolved

MadLittleMods commented Jul 27, 2022

View reviewed changes

synapse/logging/tracing.py Outdated Show resolved Hide resolved

MadLittleMods added 4 commits July 27, 2022 12:52

Revert changes to Sentry scopes (not OTEL)

0d7a2b9

See #13400 (comment)

We use the config for the Jaeger exporter now

9e1de86

Fix some lints

f6c3b22

Fixup some todos

3a25996

MadLittleMods commented Jul 28, 2022

View reviewed changes

synapse/logging/tracing.py Outdated Show resolved Hide resolved

MadLittleMods added 7 commits July 28, 2022 19:43

Fix some lints

1b0840e

Record exception

19d20b5

Explain weird function

786dd9b

Move to start_active_span

d29a4af

MadLittleMods commented Jul 30, 2022

View reviewed changes

synapse/federation/transport/server/_base.py Outdated Show resolved Hide resolved

MadLittleMods added 5 commits July 29, 2022 22:18

Working second test although it's a bit pointless testing whether ope…

041acdf

…ntelemetry works

Passing tests and context manager doesn't seem to be needed

d848156

Use correct type for what start_as_current_span returns

070195a

See: - open-telemetry/opentelemetry-python#198 (comment) - open-telemetry/opentelemetry-python#219

Use HTTP_HOST attribute

7772f50

Fix some lints

322da51

MadLittleMods commented Aug 1, 2022

View reviewed changes

synapse/federation/transport/server/_base.py Show resolved Hide resolved

MadLittleMods commented Aug 1, 2022

View reviewed changes

synapse/federation/transport/server/_base.py Show resolved Hide resolved

todos

33fd24e

Try fix Twisted/treq problems

7566375

See #13400 (comment)

MadLittleMods commented Aug 9, 2022

View reviewed changes

MadLittleMods added 8 commits August 9, 2022 14:46

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

7024d7b

Conflicts: synapse/logging/opentracing.py tests/logging/test_opentracing.py

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

50f0342

Conflicts: poetry.lock synapse/api/auth.py synapse/federation/federation_client.py synapse/logging/opentracing.py synapse/rest/client/keys.py synapse/rest/client/sendtodevice.py synapse/storage/schema/__init__.py

Try to resolve poetry deps

f73bc59

Poetry install again

a15592d

poetry update

32b9d16

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

6c40dfa

Install otel deps from develop

ad3e324

MadLittleMods commented Sep 12, 2022

View reviewed changes

pyproject.toml Show resolved Hide resolved

MadLittleMods added 13 commits September 13, 2022 08:54

OTEL install with DMR

15e242e

Update Twisted to lastest

d730a46

Remove linting from CI for now

ed11237

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

19c6f6e

Conflicts: synapse/storage/schema/__init__.py

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

84f91e3

Conflicts: synapse/storage/schema/__init__.py

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

b86869f

Conflicts: .github/workflows/tests.yml poetry.lock synapse/storage/schema/__init__.py

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

e4b9898

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

4a495ac

Conflicts: synapse/handlers/message.py synapse/logging/opentracing.py

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

7d70acd

Conflicts: poetry.lock (conflicts not fixed) synapse/handlers/message.py synapse/handlers/relations.py synapse/storage/databases/main/devices.py synapse/storage/schema/__init__.py

Fix poetry.lock conflicts

627951e

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

d993cb0

Conflicts: docs/usage/configuration/config_documentation.md poetry.lock synapse/handlers/message.py synapse/http/server.py synapse/logging/opentracing.py synapse/rest/client/keys.py synapse/rest/client/room.py

github-actions bot deployed to PR Documentation Preview November 18, 2022 23:07 View deployment

Merge branch 'develop' into madlittlemods/11850-migrate-to-opentelemetry

7acb365

Conflicts: poetry.lock

github-actions bot deployed to PR Documentation Preview November 21, 2022 16:58 View deployment

MadLittleMods commented May 20, 2023

View reviewed changes

MadLittleMods closed this Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Migrate to OpenTelemetry tracing #13400

Draft: Migrate to OpenTelemetry tracing #13400

MadLittleMods commented Jul 26, 2022 •

edited

Loading

MadLittleMods Jul 26, 2022

MadLittleMods Aug 9, 2022 •

edited

Loading

MadLittleMods May 20, 2023 •

edited

Loading

MadLittleMods commented Jun 23, 2023

Draft: Migrate to OpenTelemetry tracing #13400

Draft: Migrate to OpenTelemetry tracing #13400

Conversation

MadLittleMods commented Jul 26, 2022 • edited Loading

Blockers

Dev notes

Typing

OpenTelemetry

UDP on macOS (maybe Docker)

TODO

Pull Request Checklist

MadLittleMods Jul 26, 2022

Choose a reason for hiding this comment

MadLittleMods Aug 9, 2022 • edited Loading

Choose a reason for hiding this comment

MadLittleMods May 20, 2023 • edited Loading

Choose a reason for hiding this comment

MadLittleMods commented Jun 23, 2023

MadLittleMods commented Jul 26, 2022 •

edited

Loading

MadLittleMods Aug 9, 2022 •

edited

Loading

MadLittleMods May 20, 2023 •

edited

Loading