Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nexmark q5 long running OOM due to unawareness of container memory limit #6615

Closed
Tracked by #6640
KeXiangWang opened this issue Nov 28, 2022 · 7 comments
Closed
Tracked by #6640
Assignees
Labels
type/bug Something isn't working

Comments

@KeXiangWang
Copy link
Contributor

Describe the bug

When running nexmark sql5, risingwave ComputeNode will OOM after about 30 mins.

To Reproduce

the bug emerges in EKS environment, but should also happen in local environment.
use nexmark-bench to generate data through Kafka.

Expected behavior

No response

Additional context

Based on metrics, the OOM happens on both of the two hashagg fragment(max and count).

Here's the nexmark q5

    CREATE MATERIALIZED VIEW nexmark_q5
    AS
    SELECT AuctionBids.auction,
           AuctionBids.num
    FROM (SELECT bid.auction,
                 count(*)     AS num,
                 window_start AS starttime
          FROM
              HOP(bid, date_time, INTERVAL '2' SECOND, INTERVAL '10' SECOND)
          GROUP BY window_start,
                   bid.auction) AS AuctionBids
    JOIN (SELECT max(CountBids.num) AS maxn,
                CountBids.starttime_c
          FROM (SELECT count(*)     AS num,
                      window_start AS starttime_c
                FROM HOP(bid, date_time, INTERVAL '2' SECOND, INTERVAL '10' SECOND)
                GROUP BY bid.auction,
                        window_start) AS CountBids
          GROUP BY CountBids.starttime_c) AS MaxBids
    ON
                AuctionBids.starttime = MaxBids.starttime_c AND
                AuctionBids.num >= MaxBids.maxn;
@KeXiangWang KeXiangWang added the type/bug Something isn't working label Nov 28, 2022
@github-actions github-actions bot added this to the release-0.1.15 milestone Nov 28, 2022
@BugenZhao BugenZhao self-assigned this Nov 28, 2022
@fuyufjh
Copy link
Member

fuyufjh commented Nov 28, 2022

Hi, does someone have ideas about the reason?

@BugenZhao
Copy link
Member

Hi, does someone have ideas about the reason?

I'll help to investigate it. 👀

@KeXiangWang
Copy link
Contributor Author

KeXiangWang commented Nov 29, 2022

One possible reason.
Here's the codes we get the memory size:

        let mut sys = System::new();
        sys.refresh_memory();
        sys.total_memory() as usize

In EKS environment, the compute node is encapsulated in K8S pod. One VM(EC2) can have multiple pod. The resource is limited at pod level. But the API fetchs the VM system information.
For example, if we have one VM's memory size is 16G and 4 pods on it. And the pod resource limitation is 4G.
The above codes get 16G.
The lru is evicted based on the memory size. Since the memory footprint is far from 16G, lru may not evict. However at this time, the memory footprint may already exceed 4G.

@KeXiangWang
Copy link
Contributor Author

#6536
Same Issue

@BugenZhao
Copy link
Member

Is this the same as #6536?

@lmatz
Copy link
Contributor

lmatz commented Nov 29, 2022

Related: #6536

Some ideas:

  1. LRU manager should strictly keep memory under XG
  2. Cloud configures all_memory_available_bytes correctly, i.e. 4G in this case.
  3. But ideally, kernel can also do a check (being container aware) whether the configured parameter makes sense or not, and log out a ERROR/WARN message if not, or refuse to start (?) as OOM is possible.

@BugenZhao BugenZhao changed the title nexmark q5 long running OOM nexmark q5 long running OOM due to unawareness of container memory limit Dec 19, 2022
@fuyufjh fuyufjh assigned sixletters and unassigned BugenZhao Dec 19, 2022
@lmatz
Copy link
Contributor

lmatz commented Dec 23, 2022

This should be fixed by #6536, close it now.

If happen again, feel free to reopen or create another issue, thanks!

@lmatz lmatz closed this as completed Dec 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants