Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit CPU and memory comsumption in resource-limited environment #3685

Open
v0y4g3r opened this issue Apr 10, 2024 · 3 comments
Open

Limit CPU and memory comsumption in resource-limited environment #3685

v0y4g3r opened this issue Apr 10, 2024 · 3 comments
Labels
C-feature Category Features C-performance Category Performance

Comments

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Apr 10, 2024

What problem does the new feature solve?

GreptimeDB is designed to scale from even embedded devices to mega scale cloud services. But when it runs on resource-limited devices, like industrial controller based on.Android and Windows, it does not have a framework to limit the resource consumption, namely CPU and memory usage.

What does the feature do?

This issue calls for a resource-limit framework, just like cgroup in Linux kernel, to limit the CPU and memory usage for those dedicated spawned tasks, like flush, compaction, etc.

Implementation challenges

Tokio does not provide instrumentation tools to probe the CPU and memory usage of submitted tasks, we can only wrap the tasks with our own metrics and using rate limiting strategies to limit inflight tasks.

@v0y4g3r v0y4g3r added C-feature Category Features C-performance Category Performance Difficulty: Hard labels Apr 10, 2024
@sunng87
Copy link
Member

sunng87 commented Apr 11, 2024

This reminds me a discussion with @zyy17 about an general abstraction of task spawning:

  • an implementation independent spawn function as entrypoint
  • option to spawn to current thread, to thread pool, to local subprocess or remote process (managed by various resource manager, like k8s)
  • optional option to sepecify compute resources for the task

@ActivePeter
Copy link
Contributor

I have applied for OSPP 2024 with this sub project.
Here's the reference: https://summer-ospp.ac.cn/org/prodetail/2432c0077?lang=en&list=pro
Here's the real-time records: https://fvd360f8oos.feishu.cn/docx/QTb0d75gpoCoHGxL6Q3cx9KBnig

Task Goals and Technical Solutions

I. Task Basic Goals

  1. Encapsulate and modify common/runtime, achieve task priority, and submit a set of runtime libraries
    • Modify the common/runtime in Greptime (which is a layer of encapsulation of tokio and has three thread pools to handle different tasks) to achieve the function of task priority.
  2. Implement corresponding integrity and performance tests, do a good job in corresponding analysis, records, and source code that can be quickly reproduced
    • Ensure comprehensive testing of the modified runtime library, including functional integrity testing and performance testing. At the same time, detailed analysis and records should be made to be able to quickly reproduce the testing process and results.
  3. Try some further ideas (such as resource limitations, scheduling algorithms) and do a good job in corresponding summaries, records, and source code that can be quickly reproduced
    • Explore further optimization directions such as resource limitations and scheduling algorithms, and summarize and record the process and results of the attempt, and retain the source code that can be quickly reproduced.

II. Technical Solutions

(I) Selection of Pending Timing

  1. Probability calculation mode
    • Calculate directly with probability (it is more flexible to set, but there may be too many consecutive pendings).
  2. Fixed interval mode
    • Priority 5, probability 0.5: 1 0 1 0
    • Priority 4, probability 0.66: 11 0 11 0
    • Priority 3, probability 0.75: 111 0 111 0
    • Priority 2, probability 0.9 (10/11): 1111111111 0 1111111111 0
    • Normal (no special priority): 11111111111111
  3. Mode based on time accumulation and division
    • Record the accumulated time adding_up_time and pend_t. Divide the entire running process with pend_t, and the corresponding division point is the timing of pend.
    • When scheduling each time, see how many pend_t are contained in adding_up_time, and pend as many times (or only pend once). Then take the remainder of adding_up_time. This idea can more accurately control the frequency of triggering pend (how many times pend is triggered per unit time), and the period is pend_t.

(II) Pend Recovery Mechanism

  1. Prevent tasks from not being scheduled again
    • Returning pend directly may cause the awakened task to never be scheduled again in the future.
    • The idea adopted is to register the waker again before returning pend and wake again later (add to the scheduling queue).
    • The specific approach is to use a separate delayed wake-up task for wake-up.
    • tokio::task::yield_now() offered an example

(III) Control Closed-Loop CPU and Dynamically Adjust the Pending Probability

  1. Core idea
    • When the CPU usage of the thread pool exceeds the set threshold, the deviation value is reflected as the pending probability sensitivity coefficient.
  2. Observation of computational amount
    • https://docs.rs/tokio-metrics/latest/tokio_metrics/struct.TaskMetrics.html: This library has relatively complete observation of the delay related to task.
    • https://docs.rs/sysinfo/latest/sysinfo/: This library has the ability to observe resources across platforms.

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Aug 5, 2024

@ActivePeter we can start with introducing a new wrapper runtime that can set different priorities for different tasks even if it's not yet referenced in the code base. Feel free to draft a PR once you're ready.

ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 28, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 28, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 28, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 28, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 28, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 28, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Sep 29, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Oct 20, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Oct 23, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Oct 23, 2024
ActivePeter added a commit to ActivePeter/greptimedb that referenced this issue Oct 23, 2024
github-merge-queue bot pushed a commit that referenced this issue Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category Features C-performance Category Performance
Projects
None yet
Development

No branches or pull requests

4 participants