Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

commitlog: Support traversal without opening the log #1103

Merged
merged 1 commit into from
Apr 19, 2024

Conversation

kim
Copy link
Contributor

@kim kim commented Apr 17, 2024

Traversing the commitlog without also making it available for writing would still require upfront I/O imposed by the open constructor.

Avoid that by introducing free-standing functions which start traversal right away.

@cloutiertyler
Copy link
Contributor

I am not sure I understand what this does. Does this defer the cost of open until the first .next call? I couldn't really work it out from the code at a glance.

@cloutiertyler
Copy link
Contributor

Oh I see, it's just not opening the last segment for writing?

@kim
Copy link
Contributor Author

kim commented Apr 18, 2024

Right -- we usually want to open the log for writing, and later get an iterator.

Not covered in this patch is replaying, and then get a writable commitlog without traversing the last segment twice.

Traversing the commitlog without also making it available for writing
would still require upfront I/O imposed by the `open` constructor.

Avoid that by introducing free-standing functions which start traversal
right away.
@kim kim force-pushed the kim/commitlog2/iter-ro branch from 3c5a9e8 to 161582d Compare April 18, 2024 10:18
@kim
Copy link
Contributor Author

kim commented Apr 18, 2024

The latter I would actually not try, because bootstrapping becomes quite convoluted (replay db without durability -> obtain valid commitlog writer -> set as db durability). Instead, we can make this faster using an offset index.

@lcodes Do you need commitlog::Generic to be public?

offset: u64,
) -> io::Result<impl Iterator<Item = Result<Commit, error::Traversal>>> {
commitlog::commits_from(repo::Fs::new(root), DEFAULT_LOG_FORMAT_VERSION, offset)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the only difference between these and the commitlog:: variants the abstraction over the repo?

Would it be better to just have the ones in commitlog for this and let the caller call repo::Fs::new manually?

(In my case, a custom TarRepo is used to consume the logs without extracting them from the *.tar archive first)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of what you’re using isn’t exported from the crate. We can change that, but I’ll probably be molested to add docstrings everywhere 😿

})
}

pub fn transactions_from<'a, R, D, T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having these transactions_from variants here? Maybe I'm not aware of existing code to persist the tables schema state, or how to skip entire tx ranges using these functions, but it seems that unless the caller can persist their decoder state, they always have to replay from 0. In the case of code under dev, where a full replay is needed on code changes, going through the full snapshot on every iteration is too slow, and --release isn't friendly inside the debugger.

For analytics, the custom decoder calls decode_record_fn and only replays records for the system tables to create the schemas needed to decode rows, and processes everything else on the fly from within the Visitor.

For information, I'm only using commits. This is because, given the large snapshot at the beginning of the log, it's starting at offset 0 to load the schema information, then skips to offset 163 to load the relevant table's snapshot, and then skips all the way to the first post-snapshot commit. This saves most of the startup time without having to persist the state of the schemas, ie many millions of records don't have to be decoded at all before getting to the interesting data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nevermind most of that, just realized Decoder.decode_record is meant to call decode_record_fn, and the tx could be filtered there instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_from is useful for forensics, and for replication.

@kim kim added this pull request to the merge queue Apr 19, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 19, 2024
@kim kim added this pull request to the merge queue Apr 19, 2024
Merged via the queue into master with commit 06d5481 Apr 19, 2024
6 checks passed
@kim kim deleted the kim/commitlog2/iter-ro branch April 20, 2024 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants