Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: limit 1 batch query caused OOM #8721

Closed
Tracked by #6640
fuyufjh opened this issue Mar 23, 2023 · 14 comments
Closed
Tracked by #6640

bug: limit 1 batch query caused OOM #8721

fuyufjh opened this issue Mar 23, 2023 · 14 comments
Assignees
Labels
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Mar 23, 2023

Describe the bug

In terms of timing, this issue seems to be related to the final results checking stage, which runs a batch query over the result MV:

Running command SELECT * FROM nexmark_q0 LIMIT 1

You may see this from the attached BuildKite log. Before 11:22, everything worked well; then the result check began, and we got several restarts.

I suspect the batch query caused some dramatic memory spike. Any ideas? By the way, why the “batch query’s memory usage” is always empty?

image

Slack thread: https://risingwave-labs.slack.com/archives/C0423G2NUF8/p1679395706998939

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@fuyufjh fuyufjh added the type/bug Something isn't working label Mar 23, 2023
@github-actions github-actions bot added this to the release-0.19 milestone Mar 23, 2023
@fuyufjh
Copy link
Member Author

fuyufjh commented Mar 23, 2023

2023-03-22's longevity test (longnxkbkf-20230322-170646) failed exactly the same way.

@fuyufjh
Copy link
Member Author

fuyufjh commented Mar 23, 2023

@liurenjie1024 PTAL and feel free to assign to others.

@liurenjie1024
Copy link
Contributor

After checking recect failures, it's caused by batch query, so let's close it first.

@fuyufjh
Copy link
Member Author

fuyufjh commented Apr 26, 2023

Recured at today's longevity test.

https://buildkite.com/risingwave-test/longevity-kubebench/builds/274

Every time the batch query Running command SELECT * FROM nexmark_q14 LIMIT 1 failed, the restart count would +1.

image

@fuyufjh fuyufjh reopened this Apr 26, 2023
@fuyufjh
Copy link
Member Author

fuyufjh commented Apr 26, 2023

I think the relation between crash and batch query failure is quite clear:

2023-04-26 12:22:18 CST	Failed going for retry 0 out of 3
2023-04-26 12:52:06 CST	Failed going for retry 0 out of 3
2023-04-26 13:21:53 CST	Failed going for retry 0 out of 3
2023-04-26 13:51:55 CST	Failed going for retry 0 out of 3
2023-04-26 14:22:15 CST	Failed going for retry 0 out of 3
2023-04-26 14:50:01 CST	Failed going for retry 0 out of 3
2023-04-26 15:01:17 CST	Failed going for retry 0 out of 3

image

@liurenjie1024
Copy link
Contributor

The meta cache size keeps growing:

image

https://g-2927a1b4d9.grafana-workspace.us-east-1.amazonaws.com/d/EpkBw5W4k/risingwave-test-dashboard?orgId=1&var-namespace=longnxkbkf-20230425-140801&from=1682482800000&to=1682483040000&editPanel=92

While we only allocated 300M bytes to meta cache:
https://rqa-logs.s3.ap-southeast-1.amazonaws.com/longevity/274_logs.txt
2023-04-26T07:14:13.554425Z INFO risingwave_compute::server: > total_memory: 13.00 GiB
2023-04-26T07:14:13.554428Z INFO risingwave_compute::server: > storage_memory: 3.12 GiB
2023-04-26T07:14:13.554431Z INFO risingwave_compute::server: > block_cache_capacity: 958.00 MiB
2023-04-26T07:14:13.554435Z INFO risingwave_compute::server: > meta_cache_capacity: 319.00 MiB
2023-04-26T07:14:13.554437Z INFO risingwave_compute::server: > shared_buffer_capacity: 1.56 GiB
2023-04-26T07:14:13.554441Z INFO risingwave_compute::server: > file_cache_total_buffer_capacity: 319.00 MiB
2023-04-26T07:14:13.554443Z INFO risingwave_compute::server: > compute_memory: 7.28 GiB
2023-04-26T07:14:13.554445Z INFO risingwave_compute::server: > reserved_memory: 2.60 GiB

cc @hzxa21

@soundOfDestiny
Copy link
Contributor

meta_cache_capacity

It is inevitable because they are all in use.

@liurenjie1024
Copy link
Contributor

It is inevitable because they are all in use.

Maybe we should block some operation before allocation memory?

@soundOfDestiny
Copy link
Contributor

soundOfDestiny commented Apr 27, 2023

It is inevitable because they are all in use.

Maybe we should block some operation before allocation memory?

@Little-Wallace has already found a solution. We can wait for his PR.

@liurenjie1024
Copy link
Contributor

Any update? cc @soundOfDestiny

@soundOfDestiny
Copy link
Contributor

Any update? cc @soundOfDestiny

#9517

@soundOfDestiny
Copy link
Contributor

soundOfDestiny commented May 5, 2023

FYI, #9517 is merged.

@fuyufjh fuyufjh closed this as completed May 6, 2023
@liurenjie1024
Copy link
Contributor

Let's keep this open for a while, currently we still didn't enable limit 1 in longevity test and verify it.

@liurenjie1024
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants