Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Enhance Memory Management with Lock-Free Allocator, Preallocation, and Optimized Thread-Local Caching #2825

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

beats-dh
Copy link
Collaborator

@beats-dh beats-dh commented Aug 17, 2024

Detailed Description for PR:

1. Introduction of Static Preallocation

preallocate Method: Added functionality to preallocate a fixed number of memory blocks (STATIC_PREALLOCATION_SIZE = 500) during the initialization of the LockfreePoolingAllocator. This minimizes runtime dynamic allocations, boosting overall system performance.

Thread-Safe Initialization: Ensured preallocation occurs only once using std::call_once combined with std::once_flag, guaranteeing thread-safe initialization.

2. Optimization of Thread-Local Cache

Efficient Caching Mechanism:

Introduced thread_local caches for each thread, significantly reducing contention for shared resources.

Configured the thread-local cache size to optimize memory usage while maintaining high throughput, with a default batch size of 128.

Prefetching for Performance: Implemented memory prefetching (PREFETCH) within caching logic to improve cache line utilization and reduce latency.

3. Enhancements to Allocation Process

Streaming Allocation Logic:

Allocations prioritize the thread-local cache for speed.

If the local cache is empty, memory is fetched in batches from the lock-free shared list. As a fallback, dynamic allocation ensures availability.

Dynamic Growth: The try_grow method dynamically expands capacity when approaching allocation limits, ensuring scalability.

4. Improvements to Deallocation

Balanced Deallocation Strategy:

Memory is first returned to the thread-local cache.

If the cache is full, excess memory is flushed back to the lock-free shared list, maintaining a balance between local and global resources.

False Sharing Prevention: Cache-line alignment and proper struct padding minimize false sharing, further optimizing deallocation.

5. Integration with Custom Memory Management

Polymorphic Allocators Support: Enabled integration with std::pmr::memory_resource for flexible custom memory management. This allows seamless use of LockfreeFreeList in modern memory resource-based systems.

Allocator Design: A custom LockfreePoolingAllocator was introduced to replace standard allocation mechanisms, offering finer control over memory operations.


Key Benefits of the New Implementation

1. Enhanced Performance in Multithreaded Environments

The lock-free design significantly reduces contention between threads. Thread-local caching ensures low-latency memory allocation and deallocation, critical for high-performance applications.

2. Precise Memory Management

The separation of allocation and deallocation logic between thread-local and global resources allows granular control over memory reuse, reducing fragmentation and improving predictability.

3. Dynamic Adaptability

The implementation scales dynamically with thread count and workload. Adjustments to preallocation size, batch size, and growth behavior ensure the system adapts to varying demands efficiently.

4. Flexibility for Future Extensions

This design provides a robust foundation for further enhancements. Future optimizations (e.g., adaptive batch sizes, priority-based allocation) can be easily incorporated without disrupting the core architecture.


Rationale for Replacing std::make_shared

1. Finer Control Over Memory

Unlike std::make_shared, which combines object and reference counter allocation, the new system separates and optimizes these processes. This is crucial for scenarios demanding thread-local caching and preallocation.

2. Thread-Specific Optimization

The use of thread-local caches ensures minimal contention and faster memory reuse, which std::make_shared cannot accommodate.

3. Improved Scalability

The lock-free shared list and dynamic growth capabilities enable efficient scaling in high-concurrency environments, outperforming the general-purpose design of std::make_shared.

4. Reduced Overhead

Granular memory management reduces memory fragmentation and overhead, offering predictable performance even under high load.

5. Customizability

The system supports advanced features such as prefetching, cache-line alignment, and integration with polymorphic memory resources, none of which are possible with std::make_shared.


In summary, this implementation introduces a high-performance, scalable memory management solution tailored for multithreaded environments. It replaces std::make_shared to offer greater flexibility, precision, and efficiency in memory allocation and deallocation.

This comment was marked as outdated.

@jhogberg jhogberg mentioned this pull request Sep 12, 2024
3 tasks
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity.

@github-actions github-actions bot added Stale No activity and removed Stale No activity labels Oct 19, 2024
Copy link

Copy link
Contributor

This PR is stale because it has been open 45 days with no activity.

@github-actions github-actions bot added the Stale No activity label Nov 30, 2024
@github-actions github-actions bot removed the Stale No activity label Dec 7, 2024
Copy link

sonarqubecloud bot commented Jan 1, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants