refactor: stream JLAP repodata writes #891

AaronOpfer · 2024-10-04T00:17:45Z

JLAP previously had up to three copies of repodata simultaneously: the read bytes, the in-memory Serde representation, and the to-be-written bytes. Avoid having multiple copies of repodata in-memory simultaneously by streaming repodata reads and writes, which should bring us down to just one in-memory Serde copy.

Does anybody have tips for recreating a more heavyweight JLAP patching scenario for the purpose of testing this patch's impact? For now, I am going on green test-cases and good vibes...

JLAP previously had up to three copies of repodata simultaneously: the read bytes, the in-memory Serde representation, and the to-be-written bytes. Avoid having multiple copies of repodata in-memory simultaneously by streaming repodata reads and writes, which should bring us down to just one in-memory Serde copy.

wolfv · 2024-10-04T07:44:47Z

I have a tip for recreating a more heavyweight scenario! :) We have quite a lot of old repodata on GHCR. https://github.com/orgs/channel-mirrors/packages/container/package/conda-forge%2Flinux-64%2Frepodata.json is where it is with a date tag.

You can pull the repodata using oras (pixi global install oras and then oras pull ghcr.io/channel-mirrors/conda-forge/linux-64/repodata.json:2024.09.25.18.45). That should give you an old conda-forge repodata snapshot.

crates/rattler_repodata_gateway/src/fetch/jlap/mod.rs

wolfv · 2024-10-06T05:57:00Z

Awesome stuff! Were you able to check wether this improves performance or memory consumption?

crates/rattler_repodata_gateway/Cargo.toml

crates/rattler_repodata_gateway/src/fetch/jlap/mod.rs

AaronOpfer · 2024-10-06T13:23:01Z

Awesome stuff! Were you able to check wether this improves performance or memory consumption?

Still haven't actually tested this code besides cargo test, but from reading relevant issue serde-rs/json#635 , it sounds like serde_json::Value is probably the true culprit of the high memory consumption. In this issue, commenters observed peak memory consumption 10x higher than the size of the JSON in question, and one commenter created ijson crate as a direct answer. We probably can't just drop this in since the JSON Patcher library would need to be changed to accommodate this type.

I suspect we may be consuming extra memory because of parallel patching of noarch and linux-64 repodata in different threads. Memory could be reduced by half if we found a way to force this to be serial on a low-memory system, but that requires defining a "low-memory system" somewhere, and finding a way to snake that requirement through all the layers that need to know about it.

I can proceed with my patches, but I don't believe they will be adequate for solving my personal memory issue. Purely from a selfish perspective, the best value way to deal with this problem for me is probably to just disable JLAP.

AaronOpfer · 2024-10-06T20:22:45Z

I've decided to revert memmap changes and buffered reader changes and swap to std::mem::drop, since it is the cheapest to implement for the time I have left to spend on this particular issue.

I doubt that my changes will make enough of a difference to unlock use of JLAP on a memory-constrained environment, I think it would be better to just wait for sharded repodata to make JLAP irrelevant than invest more time here.

baszalmstra

That makes sense. Thanks for doing this deep dive though!

I have one small suggestion but otherwise looks good to me!

crates/rattler_repodata_gateway/Cargo.toml

AaronOpfer force-pushed the jlapmem branch from a914ddb to fd647d4 Compare October 4, 2024 01:54

AaronOpfer changed the title ~~stream JLAP repodata reads and writes~~ refactor: stream JLAP repodata reads and writes Oct 4, 2024

put back fs_err

0d7b4ad

baszalmstra requested changes Oct 4, 2024

View reviewed changes

crates/rattler_repodata_gateway/src/fetch/jlap/mod.rs Outdated Show resolved Hide resolved

crates/rattler_repodata_gateway/src/fetch/jlap/mod.rs Outdated Show resolved Hide resolved

switch to memmap; apply PR comments

19ce3cb

baszalmstra reviewed Oct 6, 2024

View reviewed changes

crates/rattler_repodata_gateway/Cargo.toml Outdated Show resolved Hide resolved

crates/rattler_repodata_gateway/src/fetch/jlap/mod.rs Outdated Show resolved Hide resolved

revert memmap changes and add std::mem::drop

c4a5d9a

baszalmstra reviewed Oct 7, 2024

View reviewed changes

crates/rattler_repodata_gateway/Cargo.toml Outdated Show resolved Hide resolved

Update crates/rattler_repodata_gateway/Cargo.toml

cdc1239

baszalmstra approved these changes Oct 7, 2024

View reviewed changes

baszalmstra changed the title ~~refactor: stream JLAP repodata reads and writes~~ refactor: stream JLAP repodata writes Oct 7, 2024

baszalmstra enabled auto-merge (squash) October 7, 2024 10:15

baszalmstra merged commit d1155f6 into conda:main Oct 7, 2024
16 checks passed

This was referenced Oct 7, 2024

chore: release #890

Merged

chore: release #899

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: stream JLAP repodata writes #891

refactor: stream JLAP repodata writes #891

AaronOpfer commented Oct 4, 2024

wolfv commented Oct 4, 2024

wolfv commented Oct 6, 2024

AaronOpfer commented Oct 6, 2024

AaronOpfer commented Oct 6, 2024

baszalmstra left a comment

refactor: stream JLAP repodata writes #891

refactor: stream JLAP repodata writes #891

Conversation

AaronOpfer commented Oct 4, 2024

wolfv commented Oct 4, 2024

wolfv commented Oct 6, 2024

AaronOpfer commented Oct 6, 2024

AaronOpfer commented Oct 6, 2024

baszalmstra left a comment

Choose a reason for hiding this comment