Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I would like to write huge trees but don't retain the entire tree in memory #1031

Open
skinkie opened this issue May 7, 2024 · 9 comments
Open
Labels
enhancement New feature or request

Comments

@skinkie
Copy link
Contributor

skinkie commented May 7, 2024

Ideally I would like to write out a tree where the data is added just in time. The proposal in #1030 has an increasing memory usage, which suggests that the tree is still being build completely in memory. I wanted to add some evidence. Please ignore the timing.

Using the generator method:
mem-graph-generator

Materializing into a list first:
mem-graph

Ideally, I wish that the memory consumption wouldn't increase at all, and the data would just been written out as it would be provided. But I guess the graphs do give a clear view where we can make some improvements when writing out huge documents.

@tefra tefra added the enhancement New feature or request label May 7, 2024
@tefra
Copy link
Owner

tefra commented May 7, 2024

We need to fully support the Iterable type annotation for infinite generators in the data models, and the serializers.

The pr is a good first attempt @skinkie but it needs some more work

@skinkie
Copy link
Contributor Author

skinkie commented May 10, 2024

Doing a 3.4GB file using generators, takes ~12GB of memory to write using LxmlEventwriter. XmlEventWriter does absolutely not take any memory while writing to disk, and it does it in a streaming fashion. I think this must be investigated, especially if LxmlEventWriter is the default. I rewrote my whole project to split up stuff because I was under the impression I couldn't get it stored in memory.

@skinkie
Copy link
Contributor Author

skinkie commented May 11, 2024

@tefra it is mentioned that there are alternatives, but not the characteristics of the two.

@tefra
Copy link
Owner

tefra commented Oct 20, 2024

Hey @skinkie give this pr #1082 a try.

@skinkie
Copy link
Contributor Author

skinkie commented Oct 20, 2024

Hey @skinkie give this pr #1082 a try.

Is there a reason that #1082 would address specifically this issue? Obviously gonna test the Iterable stuff ;-)

@tefra
Copy link
Owner

tefra commented Oct 20, 2024

It will allow you to use generators without mypy errors, what else did you have in mind?

@skinkie
Copy link
Contributor Author

skinkie commented Oct 20, 2024

@tefra the generators will resolve the memory consumption, not the peak afterwards. I think this is the difference between the LxmlEventwriter vs XmlEventWriter.

@tefra
Copy link
Owner

tefra commented Oct 21, 2024

Have you pin-pointed where the peak happens? Can you share your benchmark script?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants