nexmark q5 long running OOM due to unawareness of container memory limit #6615

KeXiangWang · 2022-11-28T05:27:08Z

Describe the bug

When running nexmark sql5, risingwave ComputeNode will OOM after about 30 mins.

To Reproduce

the bug emerges in EKS environment, but should also happen in local environment.
use nexmark-bench to generate data through Kafka.

Expected behavior

No response

Additional context

Based on metrics, the OOM happens on both of the two hashagg fragment(max and count).

Here's the nexmark q5

    CREATE MATERIALIZED VIEW nexmark_q5
    AS
    SELECT AuctionBids.auction,
           AuctionBids.num
    FROM (SELECT bid.auction,
                 count(*)     AS num,
                 window_start AS starttime
          FROM
              HOP(bid, date_time, INTERVAL '2' SECOND, INTERVAL '10' SECOND)
          GROUP BY window_start,
                   bid.auction) AS AuctionBids
    JOIN (SELECT max(CountBids.num) AS maxn,
                CountBids.starttime_c
          FROM (SELECT count(*)     AS num,
                      window_start AS starttime_c
                FROM HOP(bid, date_time, INTERVAL '2' SECOND, INTERVAL '10' SECOND)
                GROUP BY bid.auction,
                        window_start) AS CountBids
          GROUP BY CountBids.starttime_c) AS MaxBids
    ON
                AuctionBids.starttime = MaxBids.starttime_c AND
                AuctionBids.num >= MaxBids.maxn;

The text was updated successfully, but these errors were encountered:

fuyufjh · 2022-11-28T07:10:13Z

Hi, does someone have ideas about the reason?

BugenZhao · 2022-11-28T08:26:32Z

Hi, does someone have ideas about the reason?

I'll help to investigate it. 👀

KeXiangWang · 2022-11-29T01:53:05Z

One possible reason.
Here's the codes we get the memory size:

        let mut sys = System::new();
        sys.refresh_memory();
        sys.total_memory() as usize

In EKS environment, the compute node is encapsulated in K8S pod. One VM(EC2) can have multiple pod. The resource is limited at pod level. But the API fetchs the VM system information.
For example, if we have one VM's memory size is 16G and 4 pods on it. And the pod resource limitation is 4G.
The above codes get 16G.
The lru is evicted based on the memory size. Since the memory footprint is far from 16G, lru may not evict. However at this time, the memory footprint may already exceed 4G.

KeXiangWang · 2022-11-29T02:04:14Z

#6536
Same Issue

BugenZhao · 2022-11-29T02:04:27Z

Is this the same as #6536?

lmatz · 2022-11-29T02:04:32Z

Related: #6536

Some ideas:

LRU manager should strictly keep memory under XG
Cloud configures all_memory_available_bytes correctly, i.e. 4G in this case.
But ideally, kernel can also do a check (being container aware) whether the configured parameter makes sense or not, and log out a ERROR/WARN message if not, or refuse to start (?) as OOM is possible.

lmatz · 2022-12-23T05:09:13Z

This should be fixed by #6536, close it now.

If happen again, feel free to reopen or create another issue, thanks!

KeXiangWang added the type/bug Something isn't working label Nov 28, 2022

github-actions bot added this to the release-0.1.15 milestone Nov 28, 2022

BugenZhao self-assigned this Nov 28, 2022

fuyufjh mentioned this issue Nov 29, 2022

Tracking: Critical Performance & Stability Issues #6640

Open

65 tasks

BugenZhao changed the title ~~nexmark q5 long running OOM~~ nexmark q5 long running OOM due to unawareness of container memory limit Dec 19, 2022

fuyufjh assigned sixletters and unassigned BugenZhao Dec 19, 2022

lmatz closed this as completed Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nexmark q5 long running OOM due to unawareness of container memory limit #6615

nexmark q5 long running OOM due to unawareness of container memory limit #6615

KeXiangWang commented Nov 28, 2022

fuyufjh commented Nov 28, 2022

BugenZhao commented Nov 28, 2022

KeXiangWang commented Nov 29, 2022 •

edited

Loading

KeXiangWang commented Nov 29, 2022

BugenZhao commented Nov 29, 2022

lmatz commented Nov 29, 2022 •

edited

Loading

lmatz commented Dec 23, 2022 •

edited

Loading

nexmark q5 long running OOM due to unawareness of container memory limit #6615

nexmark q5 long running OOM due to unawareness of container memory limit #6615

Comments

KeXiangWang commented Nov 28, 2022

Describe the bug

To Reproduce

Expected behavior

Additional context

fuyufjh commented Nov 28, 2022

BugenZhao commented Nov 28, 2022

KeXiangWang commented Nov 29, 2022 • edited Loading

KeXiangWang commented Nov 29, 2022

BugenZhao commented Nov 29, 2022

lmatz commented Nov 29, 2022 • edited Loading

lmatz commented Dec 23, 2022 • edited Loading

KeXiangWang commented Nov 29, 2022 •

edited

Loading

lmatz commented Nov 29, 2022 •

edited

Loading

lmatz commented Dec 23, 2022 •

edited

Loading