Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch task memory leak #7696

Closed
Tracked by #6640
xxchan opened this issue Feb 3, 2023 · 12 comments
Closed
Tracked by #6640

batch task memory leak #7696

xxchan opened this issue Feb 3, 2023 · 12 comments
Assignees
Labels
priority/high type/bug Something isn't working

Comments

@xxchan
Copy link
Member

xxchan commented Feb 3, 2023

Describe the bug

I ran the following script:

import psycopg2

conn = psycopg2.connect("dbname=dev user=root port=4566 host=localhost")
cur = conn.cursor()
cur.execute(
    "drop table if exists t ;create table t (x int);"
)

for i in range(1000000):
    cur.execute(f"INSERT INTO t values ({i});")

And observed memory usage keep increasing. It seems the memory used by batch task is not freed. Is this normal?

image

To Reproduce

No response

Expected behavior

No response

Additional context

Find together with #7694, but they seem to be different issues?

@xxchan xxchan added the type/bug Something isn't working label Feb 3, 2023
@github-actions github-actions bot added this to the release-0.1.17 milestone Feb 3, 2023
@BugenZhao
Copy link
Member

BugenZhao commented Feb 6, 2023

If it's caused by the spawned task not being freed, then it's reasonable to happen along with #7694. Since the task is not freed, then

@xxchan
Copy link
Member Author

xxchan commented Feb 6, 2023

Here's another test and I'm not sure how to interpret it 🥵 😇

Inserted 5000000 rows in the first hill. After a while the memory usage droped to 480M. Then inserted another 5000000 rows. This time the memory usage didn't drop. Then inserted again -- OOM

image

I tested using the following script:

create table t (id int primary key,uid int,v1 int,v2 float,s1 varchar,s2 varchar,update_time timestamp);

(c1,c2) = (0, 5000000) for the first time and (5000000, 10000000) for the second time.

import psycopg2
import random
import datetime

class InsertValue(object):
    def __init__(self):
        self.conn = psycopg2.connect(
            database="dev", user="root", host="127.0.0.1", port=4566
        )

    def parse(self, c1, c2):
        cursor = self.conn.cursor()
        for step in range(c1, c2, 10):
            li = []
            for j in range(step, step + 10):
                v2 = random.uniform(1, 1000)
                update_time = datetime.datetime.now()
                vv = """({},{},{},{},'test1','test2','{}')""".format(
                    j, j, j, v2, update_time
                )
                li.append(vv)
            v = ",".join(li)
            sql = """insert into t values {};""".format(v)
            cursor.execute(sql)
        self.conn.commit()
        cursor.close()

    def run(self, c1, c2):
        self.parse(c1, c2)
        self.conn.close()


if __name__ == "__main__":
    iv = InsertValue()
    iv.run(0, 5000000)

@fuyufjh
Copy link
Member

fuyufjh commented Feb 7, 2023

@xx01cyx
Copy link
Contributor

xx01cyx commented Feb 7, 2023

This is caused by BatchTaskExecution not being freed after query execution. Each batch task is stored in a hashmap in BatchManager and never gets removed.
https://github.com/risingwavelabs/risingwave/blob/main/src/batch/src/task/task_manager.rs#L96-L97

@fuyufjh
Copy link
Member

fuyufjh commented Feb 7, 2023

We seem to have fixed a similar problem 🤔 cc. @BowenXiao1999 @liurenjie1024

@xx01cyx
Copy link
Contributor

xx01cyx commented Feb 7, 2023

We seem to have fixed a similar problem 🤔 cc. @BowenXiao1999 @liurenjie1024

IIRC that's query execution, which increases memory usage in frontend: #5827

@BugenZhao
Copy link
Member

Seems another WeakHashMap-like issue... 🤔

@BowenXiao1999
Copy link
Contributor

BowenXiao1999 commented Feb 7, 2023

Yes there will be some hashmap's reserve memory problem.

We might need a background corountine to do the clean-up. WeakHashMap or naive way may not help, as the task's upstream may keep reading from it (To be specific, get the receiver), and we do not know in what time we can delete the task (can not delete it immediately once the task is finished, as upstream may not started, so it's possible the task output channel is dropped), so we might need some status flag.

@liurenjie1024
Copy link
Contributor

Yes, this is an known issue for distributed mode, and we need a fix later.

@st1page st1page assigned liurenjie1024 and unassigned st1page Feb 7, 2023
@xxchan
Copy link
Member Author

xxchan commented Feb 8, 2023

BTW, should we have a test for such problems (memory leak after a lot of batch tasks). e.g., in longevity test?

@liurenjie1024
Copy link
Contributor

BTW, should we have a test for such problems (memory leak after a lot of batch tasks). e.g., in longevity test?

Good idea, but currently test team has no resources for longevity test for distributed query, so we can postpone it.

@liurenjie1024
Copy link
Contributor

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/high type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants