Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: stream JLAP repodata writes #891

Merged
merged 5 commits into from
Oct 7, 2024
Merged

Conversation

AaronOpfer
Copy link
Contributor

JLAP previously had up to three copies of repodata simultaneously: the read bytes, the in-memory Serde representation, and the to-be-written bytes. Avoid having multiple copies of repodata in-memory simultaneously by streaming repodata reads and writes, which should bring us down to just one in-memory Serde copy.

Does anybody have tips for recreating a more heavyweight JLAP patching scenario for the purpose of testing this patch's impact? For now, I am going on green test-cases and good vibes...

JLAP previously had up to three copies of repodata simultaneously: the
read bytes, the in-memory Serde representation, and the to-be-written
bytes. Avoid having multiple copies of repodata in-memory simultaneously
by streaming repodata reads and writes, which should bring us down to
just one in-memory Serde copy.
@AaronOpfer AaronOpfer changed the title stream JLAP repodata reads and writes refactor: stream JLAP repodata reads and writes Oct 4, 2024
@wolfv
Copy link
Contributor

wolfv commented Oct 4, 2024

I have a tip for recreating a more heavyweight scenario! :) We have quite a lot of old repodata on GHCR. https://github.com/orgs/channel-mirrors/packages/container/package/conda-forge%2Flinux-64%2Frepodata.json is where it is with a date tag.

You can pull the repodata using oras (pixi global install oras and then oras pull ghcr.io/channel-mirrors/conda-forge/linux-64/repodata.json:2024.09.25.18.45). That should give you an old conda-forge repodata snapshot.

@wolfv
Copy link
Contributor

wolfv commented Oct 6, 2024

Awesome stuff! Were you able to check wether this improves performance or memory consumption?

@AaronOpfer
Copy link
Contributor Author

Awesome stuff! Were you able to check wether this improves performance or memory consumption?

Still haven't actually tested this code besides cargo test, but from reading relevant issue serde-rs/json#635 , it sounds like serde_json::Value is probably the true culprit of the high memory consumption. In this issue, commenters observed peak memory consumption 10x higher than the size of the JSON in question, and one commenter created ijson crate as a direct answer. We probably can't just drop this in since the JSON Patcher library would need to be changed to accommodate this type.

I suspect we may be consuming extra memory because of parallel patching of noarch and linux-64 repodata in different threads. Memory could be reduced by half if we found a way to force this to be serial on a low-memory system, but that requires defining a "low-memory system" somewhere, and finding a way to snake that requirement through all the layers that need to know about it.

I can proceed with my patches, but I don't believe they will be adequate for solving my personal memory issue. Purely from a selfish perspective, the best value way to deal with this problem for me is probably to just disable JLAP.

@AaronOpfer
Copy link
Contributor Author

I've decided to revert memmap changes and buffered reader changes and swap to std::mem::drop, since it is the cheapest to implement for the time I have left to spend on this particular issue.

I doubt that my changes will make enough of a difference to unlock use of JLAP on a memory-constrained environment, I think it would be better to just wait for sharded repodata to make JLAP irrelevant than invest more time here.

Copy link
Collaborator

@baszalmstra baszalmstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Thanks for doing this deep dive though!

I have one small suggestion but otherwise looks good to me!

crates/rattler_repodata_gateway/Cargo.toml Outdated Show resolved Hide resolved
@baszalmstra baszalmstra changed the title refactor: stream JLAP repodata reads and writes refactor: stream JLAP repodata writes Oct 7, 2024
@baszalmstra baszalmstra enabled auto-merge (squash) October 7, 2024 10:15
@baszalmstra baszalmstra merged commit d1155f6 into conda:main Oct 7, 2024
16 checks passed
This was referenced Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants