Collect iterator into chunks early #433

RReverser · 2023-10-16T15:44:04Z

Description of Changes

Initially we used iterator because we wanted to lazily pull values from the database when Wasm requests the next item, so that e.g. early stop works as expected and doesn't read too much data.

Unfortunately, we discovered that this leads to crashes when someone tries to iterate over the database and update/delete items inside the iteration since iteration itself held a lock. We added a workaround by collecting the iterator early into a Vec, and then returning a newly created iterator to the user. This fixed the issue, but at this point lazy iterator mechanism turned from an optimisation into pure overhead - if we want to collect Vec, we might as well do this early in the pipeline and avoid an extra dependency.

This slightly improves iteration of empty tables as it only reduces allocation overhead per iterator, but otherwise the difference gets drowned by the speed of the iterator itself, so this is mainly code simplification.

I still hope we can return to this and implement proper end-to-end lazy iteration in the future. This should be possible, since we already have branching so any updates inside iteration shouldn't affect iteration itself and both could happen without holding a global lock, but in the past this has proven difficult.

Note: this PR will conflict with #420. I thought I'd wait until that one is merged, but looks like there are some reservations about the ABI-breaking change for now, so submitting them in parallel. Either way shouldn't be a difficult conflict resolution.

API and ABI

This is a breaking change to the module ABI
This is a breaking change to the module API
This is a breaking change to the ClientAPI
This is a breaking change to the SDK API

If the API is breaking, please state below what will break

Expected complexity level and risk

How complicated do you think these changes are? Grade on a scale from 1 to 5,
where 1 is a trivial change, and 5 is a deep-reaching and complex change.

This complexity rating applies not only to the complexity apparent in the diff,
but also to its interactions with existing and future code.

If you answered more than a 2, explain what is complex about the PR,
and what other components it interacts with in potentially concerning ways.

Initially we used iterator because we wanted to lazily pull values from the database when Wasm requests the next item, so that e.g. early stop works as expected and doesn't read too much data. Unfortunately, we discovered that this leads to crashes when someone tries to iterate over the database and update/delete items inside the iteration since iteration itself held a lock. We added a workaround by collecting the iterator early into a Vec, and then returning a newly created iterator to the user. This fixed the issue, but at this point lazy iterator mechanism turned from an optimisation into pure overhead - if we want to collect Vec, we might as well do this early in the pipeline and avoid an extra dependency. This slightly improves iteration of empty tables, but otherwise the difference gets drowned by the speed of the iterator itself, so this is mainly code simplification. I still hope we can return to this and implement proper end-to-end lazy iteration in the future. This should be possible, since we already have branching so any updates inside iteration shouldn't affect iteration itself and both could happen without holding a global lock, but in the past this has proven difficult.

kulakowski · 2023-10-16T16:52:15Z

Seems like an interesting bit of use case for the mem-arch stuff to keep in mind.

RReverser · 2023-10-16T17:17:17Z

Seems like an interesting bit of use case for the mem-arch stuff to keep in mind.

Sorry, not sure I understand - which part are you referring to that would be related to mem-arch?

kulakowski · 2023-10-16T17:21:43Z

Seems like an interesting bit of use case for the mem-arch stuff to keep in mind.

Sorry, not sure I understand - which part are you referring to that would be related to mem-arch?

The goal of that effort is to redesign our memory architecture and think about when and where we copy. This is a somewhat performance sensitive place where we make tradeoffs around ease of implementation and memory used. Supporting the "proper end-to-end lazy iteration" (or deciding not to!) seems like a good thing for the memory architecture to have in mind.

RReverser · 2023-10-16T17:32:10Z

Ah yeah that makes sense.

That said, the issue there was not so much with memory architecture, as with... idk, locking / forking architecture?

My idea back when this issue first arose was that we should leverage our existing ability to branch out on edits, so that iteration can keep the existing branch alive and prevent it from getting flattened, while any updates that happen inside iteration are added to a new branch.

However, there were some concerns and questions about whether updates inside iteration should be visible to iteration (personally I think the answer is no, and that's what we implement now with buffering anyway) and how to avoid taking the lock, so we never ended up implementing that.

crates/core/src/host/instance_env.rs

Not sure why openssl-src disappeared from Cargo.lock before, but here we go.

RReverser requested a review from cloutiertyler October 16, 2023 17:18

RReverser mentioned this pull request Oct 16, 2023

Pack bsatns into buffers for iter_filtered #254

Closed

RReverser requested a review from kulakowski October 16, 2023 17:36

kulakowski approved these changes Oct 16, 2023

View reviewed changes

crates/core/src/host/instance_env.rs Show resolved Hide resolved

Add doc comments

a6e2e0c

RReverser enabled auto-merge (squash) October 16, 2023 18:04

Commit new Cargo.lock

d1689cb

Not sure why openssl-src disappeared from Cargo.lock before, but here we go.

RReverser merged commit 6715b4c into master Oct 16, 2023
5 checks passed

RReverser deleted the iter-chunked-writer branch October 16, 2023 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect iterator into chunks early #433

Collect iterator into chunks early #433

RReverser commented Oct 16, 2023

kulakowski commented Oct 16, 2023

RReverser commented Oct 16, 2023 •

edited

Loading

kulakowski commented Oct 16, 2023

RReverser commented Oct 16, 2023

Collect iterator into chunks early #433

Collect iterator into chunks early #433

Conversation

RReverser commented Oct 16, 2023

Description of Changes

API and ABI

Expected complexity level and risk

kulakowski commented Oct 16, 2023

RReverser commented Oct 16, 2023 • edited Loading

kulakowski commented Oct 16, 2023

RReverser commented Oct 16, 2023

RReverser commented Oct 16, 2023 •

edited

Loading