Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decrease memory usage when csv&gzip is on #212

Merged
merged 1 commit into from
Jul 1, 2024

Conversation

zhaorongsheng
Copy link
Contributor

Proposed changes

Issue Number: close #211 211

Problem Summary:

Use row input stream instead of StringBuilder when write to gzip outputstream.

Checklist(Required)

  1. Does it affect the original behavior: (No)
  2. Has unit tests been added: (No Need)
  3. Has document been added or modified: (No Need)
  4. Does it need to update dependencies: (No)
  5. Are there any changes that cannot be rolled back: (No)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@gnehil
Copy link
Contributor

gnehil commented Jun 24, 2024

Have you done any tests? If so, can you post the results?

@zhaorongsheng
Copy link
Contributor Author

zhaorongsheng commented Jun 27, 2024

Have you done any tests? If so, can you post the results?

@gnehil We do not add test case for this case. But in our scene the spark executor will be OOM when using original code. And it has been run successfully when using the optimised code with the same data and conf.

.format(format)
.sep(columnSeparator)
.delim(lineDelimiter)
.schema(schema)
.addDoubleQuotes(addDoubleQuotes).build, streamingPassthrough)
val content = recordBatchString.getContent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that formatting the entire batch of data into a string object takes up extra memory, and copying between streams reduces the memory usage to a read buffer size.

@gnehil
Copy link
Contributor

gnehil commented Jun 27, 2024

LGTM

Copy link
Member

@JNSimba JNSimba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JNSimba JNSimba merged commit 3e745e7 into apache:master Jul 1, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement] The executor memory usage will be double when write to doris with csv&gz
3 participants