Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(memory-manager): should not use prev_memory_stats.streaming_memory_usage #9384

Merged
merged 7 commits into from
Apr 24, 2023

Conversation

fuyufjh
Copy link
Member

@fuyufjh fuyufjh commented Apr 23, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Fix #9383.

This PR also reverts #8253 and #8475 but keeps the trait MemoryControlPolicy (renamed to "MemoryControl") for possible future usage.

The reason of that is, we are not actually ready for such "MemoryControlPolicy" right now. Back to that point when we wrote #8253 and #8475, we didn't realize that these policy cannot work at all without a better memory estimation.

After that, when #8827 merged, things become worse. Code of #8827 mixed up 2 things:

  • the machanism of measuring current memory usage
  • and the policy about how the memory should be used

... which made that part of code become quite error-prone. I think #9383 is one of its result.

(As you may see, I renamed StreamingOnlyPolicy to JemallocMemoryControl to emphasize it is a machanism of measuring rather than policy)

In my mind, This is a typical case of over-design or premature design. Thus, I strongly tend to remove these unused code and keep it simple and clear. We can add them back when memory estimation is ready, at which time we should get rid of JemallocMemoryControl and add back "StreamingOnlyPolicy" and "StreamingBatchPolicy".

Checklist For Contributors

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

  • I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

My PR DOES NOT contain user-facing changes.

Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

@github-actions github-actions bot added the type/fix Bug fix label Apr 23, 2023
@codecov
Copy link

codecov bot commented Apr 23, 2023

Codecov Report

Merging #9384 (021d321) into main (8a494e5) will increase coverage by 0.03%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main    #9384      +/-   ##
==========================================
+ Coverage   70.79%   70.82%   +0.03%     
==========================================
  Files        1228     1228              
  Lines      203569   203456     -113     
==========================================
- Hits       144111   144100      -11     
+ Misses      59458    59356     -102     
Flag Coverage Δ
rust 70.82% <0.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/compute/src/lib.rs 1.53% <ø> (+0.10%) ⬆️
...rc/compute/src/memory_management/memory_manager.rs 0.00% <0.00%> (ø)
src/compute/src/memory_management/mod.rs 85.08% <0.00%> (+12.69%) ⬆️
src/compute/src/memory_management/policy.rs 0.00% <0.00%> (ø)
src/compute/src/server.rs 0.00% <0.00%> (ø)
src/risedevtool/src/service_config.rs 0.00% <ø> (ø)
src/risedevtool/src/task/compute_node_service.rs 0.00% <0.00%> (ø)

... and 5 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Comment on lines -65 to -67
.arg("--memory-control-policy")
.arg(&config.memory_control_policy)
.arg("--streaming-memory-proportion")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help to confirm these arguments are not used in cloud yet? @arkbriar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, neither the operator nor the cloud control plane use them.

@fuyufjh
Copy link
Member Author

fuyufjh commented Apr 24, 2023

Longevity test: https://buildkite.com/risingwave-test/longevity-kubebench/builds/272#0187ae32-d2b6-4b63-adbf-d42b1b610286

(Run on 3e39550. No actual changes except renaming and removing parameter since then)

Copy link
Contributor

@xx01cyx xx01cyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

Comment on lines +28 to 31
pub struct JemallocMemoryControl {
threshold_graceful: usize,
threshold_aggressive: usize,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether it's a good idea to put configuration values into a policy (sth. like concrete vs abstract).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmmm I prefer this style a little bit, because these values belong to the policy's config in my mind, and besides, these values don't change in loops

@fuyufjh
Copy link
Member Author

fuyufjh commented Apr 24, 2023

The longevity can run correctly. I'll merge it then.

But sadly, the fluctuation is not completely solved.

After running NexMark (q4,q5,q7,q8) for 2 hours:

image

After running NexMark (q4,q5,q7,q8) for 15 hours:

image

@fuyufjh fuyufjh enabled auto-merge April 24, 2023 04:19
@fuyufjh fuyufjh added this pull request to the merge queue Apr 24, 2023
@st1page
Copy link
Contributor

st1page commented Apr 24, 2023

The longevity can run correctly. But sadly, the fluctuation is not completely solved.

This is not a big fluctuation as previously 🤔 And maybe the curve is not as smooth as we would like it to be under the preemptive memory allocation.

Merged via the queue into main with commit c33e1f2 Apr 24, 2023
@fuyufjh fuyufjh deleted the eric/fix_9383 branch April 24, 2023 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/fix Bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(memory-manager): should not use prev_memory_stats.streaming_memory_usage
4 participants