Skip to content

Conversation

@mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Aug 5, 2025

Fixes #111368

This PR removes EventPipe's complex dual-lock synchronization scheme and integrates ThreadSessionState cleanup as part of event reading logic.

Thread Lock Removal

  • Eliminated the EventPipe thread lock, replacing the complex dual-lock scheme (thread lock + buffer_manager lock) with a single buffer_manager lock
  • Unified synchronization: Thread list and session_state slot updates now occur atomically under one lock, eliminating race conditions between buffer operations and session state management
  • Simplified lock-free reads: Removed complex cross-lock coordination requirements

ThreadSessionState Lifecycle Cleanup

  • Fixed memory leaks: Proper cleanup of ThreadSessionState objects prevents unbounded growth in long-running sessions
  • Unified resource management: ThreadSessionState and buffer_list now have matching lifetimes, eliminating buffer_list leaks from the previous cleanup logic
  • Eliminated TSS leak scenarios when threads exit before session disable
  • Atomic cleanup operations: Single-lock model enables proper cleanup without cross-lock race conditions
  • SequencePoints are updated when a ThreadSessionState is being removed

EventPipe buffer_manager code had this complex synchronization scheme where some state was protected by a thread lock and other state was protected by the buffer_manager lock. Removing the lock introduces a little extra lock-free behavior, but overall I think it will be a simplification.

1. Previously the thread list in the buffer manager and the thread->session_state slots were updated under separate locks at different times. It was also possible that a TSS (ThreadSessionState) would never get added to the list. The function ep_session_remove_dangling_session_states was designed as a bandaid to clean that up, but it still didn't handle the case where the thread exits prior to the session being disabled which appears to permanently leak the TSS. Now the session_state slots and buffer_manager thread list are always updated together. ep_buffer_manager_deallocate_buffers will clean up all all thread session state at session disable time at the latest.

2. Previously buffer_lists had different lifetimes than TSS and they were getting leaked when ep_buffer_manager_write_all_buffers_to_file_v4 did cleanup. Now buffer_list is allocated and freed at the same as TSS so it doesn't require separate explicit cleanup.

3. Previously TSS->write_buffer was set under a different lock than buffer_list->tail_buffer which created a window where tail_buffer was the only one updated. Now they are updated atomically and that window no longer exists. This simplifies try_convert_buffer_to_read_only because there is no longer a possibility of failure.

4. We've been interested in cleaning up the TSS objects more eagerly to prevent unbounded leaks on long-running sessions. Doing this under the existing two-lock scheme would have been aggravating because we need to test if the buffers are empty under the buffer_manager lock, then do the session_state slot clear and TSS deletion under the thread lock. The locks didn't support being taken at the same time and as soon as one thread exited the buffer_manager_lock it was possible to race with another thread wanting to do the same cleanup. Now that there is only one lock the cleanup can happen atomically without needing to add yet another synchronization mechanism.

5. I removed buffer_manager_suspend_event rather than adjust its locking scheme. The method didn't do what its name implied, instead it converted writeable buffers to read-only buffers without suspending anything. There is no need to convert those buffers eagerly because the existing event reading code will automatically convert them on-demand.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR eliminates EventPipe's complex dual-lock synchronization scheme by removing the per-thread lock and implementing proper ThreadSessionState lifecycle cleanup with optimized SequencePoint tracking for better resource management.

Key changes:

  • Removes EventPipe thread locks, replacing dual-lock scheme with single buffer_manager lock
  • Implements proper ThreadSessionState cleanup to prevent memory leaks in long-running sessions
  • Adds EventPipeThreadSequencePointTrackState enum for optimized SequencePoint tracking

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/native/eventpipe/ep.c Adds session type check for process info logging
src/native/eventpipe/ep-types.h Adds thread ID container type definitions
src/native/eventpipe/ep-types-forward.h Defines EventPipeThreadSequencePointTrackState enum for thread lifecycle tracking
src/native/eventpipe/ep-thread.h Updates thread synchronization model and session state management
src/native/eventpipe/ep-thread.c Implements new thread session state lifecycle and removes thread locks
src/native/eventpipe/ep-session.h Adds session_type getter
src/native/eventpipe/ep-session.c Removes dangling session state cleanup logic
src/native/eventpipe/ep-event-instance.c Removes thread reference counting from sequence points
src/native/eventpipe/ep-buffer.h Updates lock requirements documentation
src/native/eventpipe/ep-buffer.c Updates buffer conversion synchronization
src/native/eventpipe/ep-buffer-manager.h Updates buffer manager structure and removes suspend functionality
src/native/eventpipe/ep-buffer-manager.c Major refactoring of buffer management with unified lock model
src/native/eventpipe/ep-block.c Updates sequence point block initialization for new thread ID handling
Comments suppressed due to low confidence (1)

src/native/eventpipe/ep-buffer-manager.c:577

  • The buffer_manager_requires_lock_held assertion is checking the wrong lock. This function is called from buffer_manager_allocate_buffer_for_thread which holds the buffer manager lock, but the assertion appears to be in a context where the lock may not be held.
{

@lateralusX
Copy link
Member

lateralusX commented Aug 6, 2025

By moving from two locks to one, did you analyze any potential additional lock contention and its effect on EventPipe performance? Buffer manager lock is a central lock for a session, meaning that a lot of parallel activities will queue up on this lock more frequently due to increase need to lock buffer manager. Just looking at the buffer_manager_advance_to_non_empty_buffer we reach for the buffer manager lock multiple times where the old code only took buffer manager lock when a buffer needed to be deleted after been fully consumed. I know there has been a lot of optimizations that taken place over the years to reduce the contention on buffer manager lock to improve performance in EventPipe. Just wanted to make sure we don't introduce any major performance regression by increasing the use of buffer manager lock in some critical code paths.

EP_ASSERT (thread_session_state != NULL);

ep_buffer_manager_requires_lock_held (thread_session_state->buffer_manager);
ep_buffer_manager_requires_lock_held (ep_session_get_buffer_manager (thread_session_state->session));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock should no longer be needed to call this accessor. Previously buffer_list was allocated and freed independently of ThreadStateState which meant the pointer could mutate. Now the buffer list is allocated in ep_thread_session_state_alloc and freed in ep_thread_session_state_free. Since the pointer stays constant for the lifetime of the thread_session_state no lock is required to read an immutable value.

(Technically it transitions from NULL to non-NULL inside alloc, but accessing it indirectly via the this pointer ensures all threads will see the non-NULL value. We don't support super weak memory models like Alpha that could reorder dependent loads)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I was wondering if we still needed to use a lock to synchronize mutations to the buffer_list internals, like ep_buffer_list_insert_tail and ep_buffer_list_get_and_remove_head, which all callsites are under a lock, but there weren't any "requires_lock_held" in either. I'm also wondering if theres a particular reason those are exposed as public apis instead of static

// miscategorize it, but that seems unlikely.
uint32_t last_read_delta = sequence_number - thread_sequence_number;
if (0 < last_read_delta && last_read_delta < 0x80000000)
dn_umap_threadid_uint32_insert_or_assign (ep_sequence_point_get_thread_sequence_numbers (sequence_point), os_thread_id, sequence_number);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the current sequence point won't guarantee that future sequence points are correct. When a SP is initialized the sequence point value is populated using an unsynchronized read so the initial value is only a stale lower bound on the correct value. Imagine this sequence:

Thread A creates sequence point 1 and records (A,0) in the map
Thread A writes 100 events
Thread B creates sequence point 2 and records (A, 95), (B,0) in the map.
Thread B writes 100 events
Reader thread reads 100 events from A's queue and updates SP1 to (A,100)
Reader thread sends SP1
Reader thread reads 100 events from B's queue and updates SP2 to (A,95), (B,100)
Reader thread sends SP2 which has bad data... it claims that A went backwards from 100 to 95.

I think one scheme that would compute it correctly is:

  1. Capture every thread in thread_session_state_list when creating sequence points, including the dead ones. The tracking state wouldn't be needed any longer.
  2. Prior to sending a sequence point, enumerate the thread_session_state_list and update any sequence numbers if they are less than last_read_sequence_number (accounting for integer overflow).
  3. Let move_next_event() functions update TSS->last_read_sequence_number as a side-effect.
  4. Let move_next_event() functions delete TSS instances when they no longer have any remaining events. Before the delete we should update the current sequence point (if any), delete the thread id from any other sequence points queued up, and remove the TSS from the session list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I completely missed this scenario. I extended buffer_manager_remove_and_delete_thread_session_state to also handle updating SequencePoints if any, to call from both buffer_manager_move_next_event_any|same_thread.

If the writer thread isn't currently writing an event, we should still
convert the buffer to read-only
- Set thread writing_event_in_progress flag for rundown
ep_buffer_manager_write_event now asserts that the thread's
writing_event_in_progress flag is set.

- Remove buffer free state assertion
When sessions are disabled or abruptly disconnected, not all buffers may
have been converted to readonly. log_process_info_event may never be
read by the in-proc listener session and ep_buffer_alloc failure are
both examples of when a buffer would be freed yet not have been
converted to read-only. Now that converting buffers to readonly yields
for in-progress event writing to complete, we shouldn't be deleting a
buffer that a writer thread is writing into.

- Allow thread to set a NULL session state
As part of thread_session_state removal logic, a NULL pointer can be
written int oa thread's session_state slot.

- Allow reader thread to check a thread session state's write_buffer
As part of converting buffers to readonly, the reader thread yields for
writer threads to finish writing their event. So the assertion that only
writing threads will access a thread session state's write buffer no
longer holds.
@mdh1418 mdh1418 force-pushed the ep_remove_thread_lock_and_optimize_tss_cleanup branch 2 times, most recently from 628668f to aa19e3b Compare January 7, 2026 03:36
session_state was switched to volatile in an earlier commit to remove
the thread_lock
- Remove unnecessary buffer_manager field from ThreadSessionState
The buffer_manager can be obtained through the session field

- Remove unnecessary thread_holder field from BufferList
The owning ThreadSessionState already holds a reference to the thread.
Now that the BufferList's lifetime is tied to the ThreadSessionState,
the BufferList's thread_holder is no longer needed.

- Reuse ep_buffer_get_volatile_state over manual state load
When the buffer_manager iterates over events across threads, it tracks
the current EventPipeEventInstance, EventPipeBuffer, and
EventPipeBufferList. In preparation to remove ThreadSessionStates from
the buffer_manager's thread_session_state_list, tracking the current
ThreadSessionState instead of the current EventPipeBufferList will allow
for direct removal, as they are 1:1.

ep_buffer_manager_write_all_buffers_to_file_v4 adjusts sequence points
tss:sequence_number mappings whenever events have been read from the
tss. Instead of having the tss' underlying buffer_list track the last
read sequence point, move the counter to the tss itself.
In preparation to cleanup ThreadSessionStates, adjust the SequencePoint
map of Thread sequence numbers to map the EventPipeThread directly
instead of the ThreadSessionState.
Former ThreadSessionState cleanup only accounted for select scenarios,
such as session disablement or write_all_buffers_to_file_v4. This commit
aims to integrate ThreadSessionState cleanup as part of the core logic
to advance to events over threads so all callsite may benefit.

As ThreadSessionStates hold the last read sequence number for that
EventPipeThread, it is important to update current and future
SequencePoints that have that thread as an entry before deleting the
ThreadSessionState
@mdh1418 mdh1418 force-pushed the ep_remove_thread_lock_and_optimize_tss_cleanup branch from aa19e3b to 285e3a3 Compare January 8, 2026 03:31
@lateralusX
Copy link
Member

lateralusX commented Jan 8, 2026

@mdh1418 I know you been doing some performance regression testing on this PR that indicates that this change regress events/s and increase number of dropped events so we should follow up if we need to optimize this further or change the implementation to reduce performance regression.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty nice overall. I did notice a couple potential use-after-free race conditions that we should certainly shore up and we might need to make some perf adjustments but I think this is close.

ep_return_false_if_nok (thread != NULL);

for (uint32_t i = 0; i < EP_MAX_NUMBER_OF_SESSIONS; ++i) {
EventPipeThreadSessionState *thread_session_state = (EventPipeThreadSessionState *)ep_rt_volatile_load_ptr ((volatile void **)(&thread->session_state [i]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This access to thread->session_state doesn't appear to be following thread safety rules outlined in the header comments:

	// Reading from this slot can either be done by taking the buffer manager lock, or by doing a volatile read of the pointer. The volatile
	// read should only occur when running on the OS thread associated with this EventPipeThread object. When reading under the lock the
	// pointer should not be retained past the scope of the lock. When using the volatile read, the pointer should not be retained
	// outside the period where writing_event_in_process == slot_number.
	volatile EventPipeThreadSessionState *session_state [EP_MAX_NUMBER_OF_SESSIONS];

The underlying issue is figuring out what would prevent some other thread from deleting the memory pointed to by thread_session_state sometime after executing line 123 and prior to the thread_session_state->session dereference that occurs at line 127. You would either need to use one of the two synchronization mechanisms already in place, add a new mechanism, or avoid accessing the field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, I switched from using the ThreadSessionState's session field to just loading the EventPipeSession via ep_volatile_load_session and signaling its buffer_manager (if any).

// clear the thread session state slot
ep_thread_set_session_state (ep_thread_session_state_get_thread (thread_session_state), ep_thread_session_state_get_session (thread_session_state), NULL);
// delete the thread session state and transitively the buffer list
ep_thread_session_state_free (thread_session_state);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment deallocate_buffers is called from ep_buffer_manager_free() which is called from ep_session_dec_ref() if the ref count is zero. disable_holding_lock() calls ep_session_dec_ref() and I think given the state of code right now it happens that ref is the last one, but nothing in comments or code I saw suggests that has to remain true in the future. We probably want to update this so that even if disable_holding_lock() isn't releasing the last session reference we still get proper cleanup. The bad case that we wouldn't want to happen is that disable_holding_lock() would complete, the ep_lock would be released, the session index would get reassigned to a new session and then some thread intending to write to the new session winds up writing to the old session instead because we haven't cleared the thread_session_state slot.

I suspect a simple fix would be to add some method like ep_session_close() which is called in disable_holding_lock() prior to the ep_session_dec_ref(). close() would call this function to ensure that we've freed all the buffers and detached all the threads prior to ep_lock being released. For consistency maybe rename this method buffer_manager_close(). ep_buffer_manager_free() would no longer call this method but it could assert that the thread_session_state_list is empty validating that the close() already happened.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ep_session_close() and buffer_manager_deallocate_buffers -> buffer_manager_close, and added the assert that the list is empty when freeing the buffer_manager.

andrewlock added a commit to DataDog/dd-trace-dotnet that referenced this pull request Jan 9, 2026
…8027)

## Summary of changes

Creates a .NET 6+ only implementation of `IRuntimeMetricsListener` that
uses the `System.Diagnostics.Metrics` (and other) APIs

## Reason for change

.NET Core (probably all versions, but at least .NET 6+) has a memory
leak with the event pipes, which means if we enable runtime metrics, we
likely have a slow memory leak 😬 [This was raised ~1 year ago with .NET
team](dotnet/runtime#111368), specifically
citing dd-trace-dotnet. but doesn't have a fix yet. Also a PR has been
open on the .NET repo with a tentative fix for ~2 months, so _at best_
this _might_ be fixed in .NET 11.

Separately, the `System.Diagnostics.Metrics` APIs were introduced in
.NET 6, with support for aspnetcore-based metrics added in .NET 8, and
support for "runtime" metrics in .NET 9.

This PR introduces a new (experimental for now)
`IRuntimeMetricsListener` implementation that doesn't use
`EventListener`, and instead uses the `System.Diagnostics.Metrics` APIs,
aiming to provide essentially the same runtime metrics we currently do,
just using a different source.

## Implementation details

- Created a new `IRuntimeMetricsListener` implementation,
`DiagnosticsMetricsRuntimeMetricsListener`
- Added a config to enable it in .NET 6+ only,
`DD_RUNTIME_METRICS_DIAGNOSTICS_METRICS_API_ENABLED`
- Open to suggestions here. Other options include having an "enum" type
for listener instead of just this one. That's harder to consume for
customers, but more extensible theoretically.
- Added tests

To give as wide compatibility as possible, and to avoid any additional
overhead, whenever the built-in runtime metrics use existing APIs (e.g.
via `GC` calls), we use those instead of the metrics.

In summary:

Thread metrics:
- `runtime.dotnet.threads.workers_count`: via `ThreadPool.ThreadCount`
(same as `RuntimeEventListener`)
- `runtime.dotnet.threads.contention_count: via
`Monitor.LockContentionCount`

GC metrics:
- `runtime.dotnet.gc.size.gen#` from info in `GC.GetGCMemoryInfo()`,
which mirrors [the built-in
approach](https://github.com/dotnet/runtime/blob/v10.0.1/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/RuntimeMetrics.cs#L185).
- `runtime.dotnet.gc.memory_load` was a tricky one as the built-in uses
a new API, but I think the info we get in `GC.GetGCMemoryInfo()` is
broadly good enough
- `runtime.dotnet.gc.count.gen#` uses `GC.CollectionCount()`, same as
[built-in
approach](https://github.com/dotnet/runtime/blob/v10.0.1/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/RuntimeMetrics.cs#L159)
- `runtime.dotnet.gc.pause_time` this was also a tricky one, more on it
below...

`runtime.dotnet.gc.pause_time` is a runtime metric that's available in
.NET 9, so when we're running in .NET 9 we just use that value. There's
actually also a public API introduce in .NET 6,
`GetTotalPauseDuration()`, but [it's _only_ available from
6.0.21](dotnet/runtime#87143), so we can't
directly reference it. Resorted to using a simple `CreateDelegate` call
to invoke it in these cases. We could use duck typing, but didn't seem
worth it. If we're running < 6.0.21, there's no feasible way to get the
value, so we just don't emit it.

ASP.NET Core metrics:
- `runtime.dotnet.aspnetcore.requests.current`
- `runtime.dotnet.aspnetcore.requests.failed`
- `runtime.dotnet.aspnetcore.requests.total`
- `runtime.dotnet.aspnetcore.requests.queue_length`
- `runtime.dotnet.aspnetcore.connections.current`
- `runtime.dotnet.aspnetcore.connections.queue_length`
- `runtime.dotnet.aspnetcore.connections.total`

Note that the `.total` and `.failed` requests are recorded as _gauges_
(which monotonically increase), which doesn't feel right to me (they
should be counters, surely), but that's what `RuntimeEventListener` is
using, so we have to stick to the same thing (metric types are global by
metric, so we can't change it). It means there's a risk of overflow
there, but that's already the case for `RuntimeEventListener` so I guess
we just ignore it 🤷‍♂️

I couldn't find a way to get the following metrics at all without using
`EventListener`:
- `runtime.dotnet.threads.contention_time`

## Test coverage

Added unit and integration tests for the listener behavior.

I also manually ran an aspnetcore app in a loop with both the
`RuntimeEventListener` and the new listener producing metrics (hacked
in, we wont ever do this in "normal" execution), and did a manual
comparison of the metrics. Overall, the values were in broad agreement
(slightly off, due to skew in sampling time) and helped identify some
cases where I'd made incorrect assumptions (e.g. aspnetcore `.total`
metrics are never "reset" to 0.

## Other details

Relates to:
-
#5862 (comment)
- dotnet/runtime#111368
- dotnet/runtime#118415
- https://datadoghq.atlassian.net/browse/LANGPLAT-916

---------

Co-authored-by: Steven Bouwkamp <steven.bouwkamp@datadoghq.com>
ep_raise_error_if_nok (ep_rt_wait_event_is_valid (&instance->rt_wait_event));

instance->thread_session_state_list_snapshot = NULL;
if (ep_session_get_session_type (session) == EP_SESSION_TYPE_LISTENER) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some benefit to special casing the snapshoting to only occur for listener sessions? Conceptually it seems simpler if we treat all buffer manager sessions the same way or we should add some comments about what benefits we get that justifies treating them differently.

if (thread_session_state == NULL)
continue;

EventPipeSession *session = ep_volatile_load_session (i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unforetunately this is basically the same free-vs-use-race as before, now with the Session object instead of the ThreadSession object. The session slots have very similar thread-safety requirements as the thread_session slots, just the code comments to warn you of that are missing :(

Imagine that in parallel with this function some other thread is executing disable_holding_lock(). This thread needs to prevent disable_holding_lock() from deleting the session object prior to the call to ep_session_get_buffer_manager() below. The easiest option is probably to be holding the ep lock the entire time so that disable_holding_lock() can't run concurrently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CoreCLR EventPipe CPU+memory leak

3 participants