Skip to content

Conversation

@mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Jul 24, 2025

Fixes #111368

This PR implements a comprehensive cleanup and optimization of EventPipe's ThreadSessionState management to address performance issues in scenarios with many short-lived threads.

Background

Throughout an EventPipe Session's lifetime, threads that write events induce allocations of EventPipeThread, EventPipeThreadSessionState, EventPipeBufferList, EventPipeBuffer, and EventPipeSequencePoint objects. Many of these objects were held throughout the entire session lifetime instead of just their scope of utility, leading to:

  • Performance degradation due to bloated thread_session_state_list containing exhausted instances
  • Memory inefficiency from retaining objects for threads that have exited
  • Suboptimal event discovery as the buffer manager holds instances whose buffers are exhausted and will never receive new events

This PR does the following:

  1. ThreadSessionState Cleanup: Added cleanup measures to remove ThreadSessionState instances whose buffer_list is empty and the corresponding thread has been unregistered, indicating the thread has exited and will not write new events.

  2. Sequence Point Map Optimization: Moved sequence point map updates to occur in tandem with event processing rather than relying on the potentially trimmed thread_session_state_list. The update now happens after processing all available events from a thread.

  3. Direct Thread ID Storage: Simplified sequence point handling by directly storing native thread IDs in the sequence point map instead of dynamically retrieving them through EventPipeThreadSessionState, eliminating associated ref counting complexity.

  4. Buffer Manager State Tracking: Enhanced the buffer manager to track the current ThreadSessionState being processed, enabling proper cleanup while maintaining the ability to re-add instances that receive new events.

  5. Code Quality Improvements: Updated loops to use foreach macros, consolidated sequence point logic, and improved code organization.

Testing

Running the repro provided in the original issue,

Before

Q:\source\mdh1418\TestApp> corerun .\bin\Debug\net10.0\win-x64\publish\TestApp.dll 10000 10000
Press any key to start spawning threads and throwing exceptions
Enabling CLR EventPipe

Throwing and catching 1000 exceptions (warm up JIT)

Throwing and catching 10000 exceptions
RSS: 79.5MiB, user CPU: 96.9%
40000 EventPipe events received, 10000 exceptions thrown

Spawning 10000 short-lived threads
RSS: 118.7MiB, user CPU: 195.8%
32485 EventPipe events received, 10000 exceptions thrown

Throwing and catching 10000 exceptions
RSS: 126.5MiB, user CPU: 199.7%
7128 EventPipe events received, 10000 exceptions thrown

Reconfiguring EventPipe with identical settings

Throwing and catching 10000 exceptions
RSS: 116.0MiB, user CPU: 104.1%
40014 EventPipe events received, 10000 exceptions thrown

After

Q:\source\mdh1418\TestApp> corerun .\bin\Debug\net10.0\win-x64\publish\TestApp.dll 10000 10000
Press any key to start spawning threads and throwing exceptions
Enabling CLR EventPipe

Throwing and catching 1000 exceptions (warm up JIT)

Throwing and catching 10000 exceptions
RSS: 79.3MiB, user CPU: 97.1%
40000 EventPipe events received, 10000 exceptions thrown

Spawning 10000 short-lived threads
RSS: 117.1MiB, user CPU: 121.9%
40000 EventPipe events received, 10000 exceptions thrown

Throwing and catching 10000 exceptions
RSS: 118.2MiB, user CPU: 103.6%
40000 EventPipe events received, 10000 exceptions thrown

Reconfiguring EventPipe with identical settings

Throwing and catching 10000 exceptions
RSS: 116.9MiB, user CPU: 104.7%
40000 EventPipe events received, 10000 exceptions thrown

mdh1418 added 4 commits July 22, 2025 16:03
Update thread_session_state_list loop to use foreach macro
Unindent sequence point reloop block
Consolidate sequence point loop timestamp boundary update
The event consuming thread advances to the next event using the
buffer_manager's thread_session_state_list, which signals when an event
has written to a buffer. Adding the capability to trim the list when
tss instances have exhausted their events and buffers also requires
re-adding a tss if it receives new events, which is demarcated by
allocating a new buffer_list. To facilitate freeing and NULLing out
buffer_lists, we need to pivot from logic that expects buffer_lists to
persist.

ep_buffer_manager_write_all_buffers_to_file_v4 adjusts sequence points
tss:sequence_number mappings whenever events have been read from the
tss. Instead of having the tss' underlying buffer_list track the last
read sequence point, move the counter to the tss itself to allow
buffer_lists to be NULL'ed.

In order to associate tss with their last_read_sequence_number, the
buffer_manager needs to keep track of the "current" tss from which
events are being processed.
EventPipeSequencePoints store all alive EventPipeThreads IDs and their
corresponding latest read sequence number in a map which is ultimately
used to write the sequence point block. Previously, the native thread ID
was dynamically retrieved through the EventPipeThreadSessionState,
requiring ref counting the EventPipeThread so that it would not be freed
before the immutable native thread id was written. To simplify, directly
map the native thread ids and remove associated ref counting.
The delayed EventPipeThreadSessionState cleanup added to
ep_buffer_manager_write_all_buffers_to_file_v4 does not address the
in-proc EventListener scenario, which is also susceptible to performance
issues should many short-lived threads emit any events. In addition,
the ThreadSessionState cleanup induces a leakage of buffer_lists, which
previously could only be removed by iterating over the buffer_manager's
thread_session_state_list. In preparation for an alternative cleanup
mechanism, revert the former cleanup logic.
EventPipeEventInstance *next_event;

for (uint32_t i = 0; i < dn_vector_ptr_size (&buffer_array) && i < dn_vector_ptr_size (&buffer_list_array); ++i) {
for (uint32_t i = 0; i < dn_vector_ptr_size (&thread_session_state_array) && i < dn_vector_ptr_size (&buffer_list_array); ++i) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use the FOREACH macro with only tracking the thread_session_state_array and retrieving the associated buffer_list, but ep_thread_session_state_get_buffer_list is protected by the thread lock.

mdh1418 added 2 commits July 24, 2025 17:01
Sequence Points hold a mapping of native thread ids to the threads last
read event sequence number. Previously, the buffer_manager's
thread_session_state_list was used to update the last read event
sequence number right before the sequence point is written. With plans
to cleanup the thread_session_state_list, the invariant that all
EventPipeThreadSessionStates will still be in the list after processing
events is no longer guaranteed. Instead, move the update logic to be in
tandem with processing events on a thread.

ep_buffer_manager_write_all_buffers_to_file_v4 processes events by first
finding the earliest event in all buffers and subsequently processing
all events from that thread. Once all available events have been read
from that thread, update the sequence point mapping.
Throughout an EventPipe Session's lifetime, any new thread that writes
an Event will induce a sequence of allocations including
EventPipeThread, EventPipeThreadSessionStates, EventPipeBufferList,
EventPipeBuffer, EventPipeSequencePoint, etc. Some of these objects are
held throughout the lifetime of the session instead of the scope of
their utility. Such behavior leads to a performance hit. For example,
the buffer_manager's thread_session_state_list is used to discover new
events, and without the proper cleanup, it will hold instances whose
buffers have been exhausted and may never received a new event.

To address a bloated thread_session_state_list, we add cleanup measures
to remove instances whose buffer_list is both empty and the
corresponding thread has been unregistered, indicating that the thread
has exited and will not write new events.
@mdh1418 mdh1418 force-pushed the ep_cleanup_thread_session_states branch from 463e6f5 to 1fedd96 Compare July 24, 2025 21:01
@mdh1418 mdh1418 marked this pull request as ready for review July 24, 2025 22:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes EventPipe's ThreadSessionState management to address performance issues in scenarios with many short-lived threads. The changes focus on cleaning up exhausted thread session states and optimizing sequence point handling.

Key changes:

  • Implements cleanup of ThreadSessionState instances when their buffer_list is empty and the thread has exited
  • Simplifies sequence point handling by storing native thread IDs directly instead of using EventPipeThreadSessionState references
  • Enhances buffer manager to track current thread session state being processed for proper cleanup

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/native/eventpipe/ep-thread.h Updates thread ID type and exposes buffer_manager getter for ThreadSessionState
src/native/eventpipe/ep-thread.c Adds cleanup logic in thread unregistration and fixes session state memory management
src/native/eventpipe/ep-event-instance.c Removes reference counting complexity from sequence point cleanup
src/native/eventpipe/ep-buffer-manager.h Adds current_thread_session_state tracking and cleanup function declaration
src/native/eventpipe/ep-buffer-manager.c Implements comprehensive cleanup logic and optimizes sequence point handling
src/native/eventpipe/ep-block.c Updates sequence point serialization to use thread IDs directly
src/native/containers/dn-umap-t.h Adds thread ID to uint32 map type for sequence point storage

// get the next buffer
current_buffer = buffer_list->head_buffer;

if (!current_buffer && (ep_rt_volatile_load_uint32_t_without_barrier (ep_thread_get_unregistered_ref (ep_thread_session_state_get_thread (thread_session_state))) > 0))
Copy link

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This condition check is complex and hard to read. Consider extracting it into a helper function like is_thread_unregistered_and_buffer_list_empty to improve readability and maintainability.

Suggested change
if (!current_buffer && (ep_rt_volatile_load_uint32_t_without_barrier (ep_thread_get_unregistered_ref (ep_thread_session_state_get_thread (thread_session_state))) > 0))
if (is_thread_unregistered_and_buffer_list_empty(current_buffer, thread_session_state))

Copilot uses AI. Check for mistakes.
// If a thread is able to drop more than 0x80000000 events in between sequence points then we will
// miscategorize it, but that seems unlikely.
uint32_t last_read_delta = sequence_number - thread_sequence_number;
if (0 < last_read_delta && last_read_delta < 0x80000000)
Copy link

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 0x80000000 should be defined as a named constant to clarify its purpose in sequence number overflow detection.

Suggested change
if (0 < last_read_delta && last_read_delta < 0x80000000)
if (0 < last_read_delta && last_read_delta < SEQUENCE_NUMBER_OVERFLOW_THRESHOLD)

Copilot uses AI. Check for mistakes.
// the buffer allocation logic is estimating how many
// buffers a given thread has used (see: ep_thread_session_state_get_buffer_count_estimate and its uses).
EventPipeBufferList *buffer_list;
#ifdef EP_CHECKED_BUILD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to store this field? You can get the same value by calling ep_session_get_buffer_manager (tss->session)

// This field can be read outside the lock when
// the buffer allocation logic is estimating how many
// buffers a given thread has used (see: ep_thread_session_state_get_buffer_count_estimate and its uses).
EventPipeBufferList *buffer_list;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on the field needs to be updated, its description of the lifetime is out-of-date.

/*
* buffer_manager_remove_thread_if_buffer_list_empty
*
* This function should only be called when the thread is unregistered and the buffer_list is empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we call this from thread_unregister the buffer list may not be empty and that should be fine. Part of the function's responsibility is to check.

EP_SPIN_LOCK_ENTER (&buffer_manager->rt_lock, section1)
buffer_manager_remove_thread_if_buffer_list_empty (buffer_manager, thread_session_state);
EP_SPIN_LOCK_EXIT (&buffer_manager->rt_lock, section1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also be calling ep_thread_delete_session_state() here to remove the state cached on the thread.

Allowing it to linger until the end of a session creates potential for an unbounded memory leak if an app rapidly creates threads that each write at least one event to the session and then exit.

Annoyingly the session reader thread and the writer thread might be racing to delete it and they can't synchronize using the thread lock because whichever thread loses the race will AV trying to access the lock through a now deleted TSS pointer. I'm thinking about how to solve it.

#ifdef EP_CHECKED_BUILD
for (uint32_t i = 0; i < EP_MAX_NUMBER_OF_SESSIONS; ++i) {
EP_ASSERT (thread->session_state [i] == NULL);
ep_thread_session_state_free (thread->session_state [i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think thread->session_state[i] can ever be non-NULL here. The TSS has a ThreadHolder inside of it which maintains a ref count on the thread. As long as any TSS exists the ref count is > 0 and ep_thread_free() won't get called.

TSS objects are currently getting cleaned up during session cleanup and I proposed in another comment that we should also be doing it in remove_thread_if_buffer_list_empty().


// the sequence point captured a lower bound for sequence number on each thread, but iterating
// through the events we may have observed that a higher numbered event was recorded. If so we
// should adjust the sequence numbers upwards to ensure the data in the stream is consistent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets preserve this comment by moving it to the same place you moved the sequence point update code above.

current_buffer = buffer_list->head_buffer;

if (!current_buffer && (ep_rt_volatile_load_uint32_t_without_barrier (ep_thread_get_unregistered_ref (ep_thread_session_state_get_thread (thread_session_state))) > 0))
buffer_manager_remove_thread_if_buffer_list_empty (buffer_manager, thread_session_state);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets delay this call until after exiting the buffer_manager lock below, then you can call the ep_ version of function which I suggested in another comment should also take the thread lock and cleanup the TSS state.

*events_written = events_written_for_thread || *events_written;

ep_rt_thread_id_t thread_id = ep_thread_get_os_thread_id (ep_thread_session_state_get_thread (current_thread_session_state));
dn_umap_it_t found = dn_umap_threadid_uint32_find (ep_sequence_point_get_thread_sequence_numbers (sequence_point), thread_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if buffer_manager_try_peek_sequence_point() above returns NULL?

Previously last_read_sequence_number provided a place to record our reads that was always available. I suspect to make this work you will need to change the lifetime of sequence points so that there is always at least one of them in the list. This means you may need to allocate them and enqueue them before its time to timestamp them.

Also we need to preserve the invariant that sequence numbers for the same thread only go up (not counting integer overflow). Imagine if there were two sequence points in the queue both initialized with thread A=0 in the sequence number map. Now on the read thread we read two events from thread A's buffer and increase the current sequence point to report thread A=2. However when we get to the 2nd sequence point it is still storing thread A=0. If there are no events written on thread A while flushing the 2nd sequence point the code right here isn't going to run. Something else needs to. Perhaps when you write each sequence point you can forward any needed updates to the next one.

Thinking on this some more I think its doable, but the sequence point management and updates might get messy. It might be simpler to leave the sequence points and last_read_sequence_number field how they were before this PR and just accept that TSS lifetimes with sequence points will be a little different than without sequence points but I haven't worked through the details to get a good comparison.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize that we don't have any EventPipeSequencePoints until the first remaining_sequence_point_alloc_budget has been exhausted. Would there be a problem with initializing the remaining_sequence_point_alloc_budget to 0 in ep_buffer_manager_alloc, so we guarantee there is one SequencePoint when the first event is written/buffer is allocated? I now realize that sequence points are intended to only be allocated after sufficient events have been written to buffers, and that the timestamp of the sequence point dictates which events can be processed.

I'm also realizing, could that be the reason why the SequencePoint's maps held a reference to the thread?
Right now, if we exhaust buffers and the thread is unregistered, if we delete the ThreadSessionStates in ep_buffer_manager_remove_thread_if_buffer_list_empty, that means we can hit this last_read_sequence_number update logic after the EPTSS is deleted.
Thats resolvable by replacing EventPipeThreadSessionState *current_thread_session_state = buffer_manager->current_thread_session_state; with ep_rt_thread_id_t current_thread_id = ep_thread_get_os_thread_id (ep_thread_session_state_get_thread (buffer_manager->current_thread_session_state) because we just need the "value" of the thread id to update the map, instead of the EPTSS itself.

Still, to properly update entries in the map, it sounds like there is an inherent requirement that ThreadSessionStates must live beyond all SequencePoints mapping the corresponding thread_ids in order to be able to update each entry. Unless we introduce another struct to track current_thread_id and last_read_sequence_number it doesn't seem like we can remove the thread ref-counting from the map.

}

DN_UMAP_IT_KEY_T (ptr, void *)
DN_UMAP_IT_KEY_T (threadid, ep_rt_thread_id_t)
Copy link
Member

@lateralusX lateralusX Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to do this in ep code, since ep types are not shared between libraries that could use the container headers.

DN_UMAP_T (ptr, void *, int32, int32_t)
DN_UMAP_T (ptr, void *, uint32, uint32_t)
DN_UMAP_T (uint32, uint32_t, ptr, void *)
DN_UMAP_T (threadid, ep_rt_thread_id_t, uint32, uint32_t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, no ep types in container headers.

ep_thread_addref (ep_thread_holder_get_thread (ep_thread_session_state_get_thread_holder_ref (thread_session_state)));
ep_rt_thread_id_t os_thread_id = ep_thread_get_os_thread_id (ep_thread_session_state_get_thread (thread_session_state));

dn_umap_threadid_uint32_insert (ep_sequence_point_get_thread_sequence_numbers (sequence_point), os_thread_id, sequence_number);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a thread might die when being on this map? If that happens, the old code would at least preserve the internal representation of TSS through the addref and then access it through that data structure as a representation of the thread, but now when we use thread id as key, that could be reused by OS, so if everything related to the thread TSS has been freed while we hold the thread id -> sequence point representation in this map, then there might be duplicates. In theory I believe this could happen in the past as well, but we would end up with different TSS instance but with the same internal thread id, so not sure what consequences that would introduce later, just wanted to highlight this potential scenario when we introduce thread id is identification instead of TSS.

@lateralusX
Copy link
Member

lateralusX commented Jul 28, 2025

Taking a step back to understand why we need a bigger refactor/cleanup. Why couldn't we just detect and cleanup consumed TSS on the consumer threads, like we already do in ep_buffer_manager_write_all_buffers_to_file_v4? TSS and buffers need to be kept alive until consumed by threads, so putting the cleanup logic on the consumer, like we already do in ep_buffer_manager_write_all_buffers_to_file_v4, feels straightforward. Any reason why we couldn't simplify the fix to just do that?

The following should hold:

  • Everything gets removed when a session gets deleted.
  • EP needs to keep TSS and buffers for threads that get unregistered but not yet consumed.
  • When EP have consumed all buffers for an unregistered thread, it should remove all state.
  • There is only one parallel consumer thread per session, but many writers. Consumer thread can cleanup at select locations, like ep_buffer_manager_write_all_buffers_to_file_v4.
  • Original issue was with ep_buffer_manager_get_next_event since it doesn't include any capabilities to do cleanup when things have been consumed.

Could cleanup consumed state in ep_buffer_manager_get_next_event when detecting that there are empty buffers, like you do in ep_buffer_manager_remove_thread_if_buffer_list_empty, but call it in ep_buffer_manager_get_next_event and only when the call have detected empty buffers and maybe seen that the TSS thread is unregistered be an alternative? That way we shouldn't need to change ep_buffer_manager_write_all_buffers_to_file_v4 implementation and could probably fix the memory leak without additional regression risks.

@mdh1418 mdh1418 marked this pull request as draft July 29, 2025 20:32
@mdh1418
Copy link
Member Author

mdh1418 commented Aug 5, 2025

Closed in favor of #118415

@mdh1418 mdh1418 closed this Aug 5, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Sep 5, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CoreCLR EventPipe CPU+memory leak

3 participants