Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: arrow based table state and checkpoint handling #1837

Closed
wants to merge 28 commits into from

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Nov 11, 2023

this is based on #1829 which is based on #1807

Description

Allows for reading and writing checkpoints more in line with current spec and updates the internal representation of the table state to use Arrow record batches. This is mainly in pursuit of trying to be more aligned with the ongoing work in kernel as well as moving our protocol support forward.

Still very much a work in progress, but opening it for feedback and discussion.

As a follow up I am planning on moving some of the updated code back to kernel.

Related Issue(s)

supersedes: #454
closes: #1776
closes: #425 (should also be addressed in the current implementation)
closes: #288 (multi-part checkpoints are deprecated)
related: #1598 (maybe we can even close this)
related: #435

we may also want to consider closing: #125

Documentation

On order to retain the parquet2 functionality and ease the transition I defined a Snapshot trait, that looks similar to a kernel snapshot. The new arrow backed state as well as the existing DeltaTableState implement this trait.

/// A snapshot of a Delta table at a given version.
pub trait Snapshot: std::fmt::Display + Send + Sync + std::fmt::Debug + 'static {
    /// The version of the table at this [`Snapshot`].
    fn version(&self) -> i64;

    /// Table [`Schema`](crate::kernel::schema::StructType) at this [`Snapshot`]'s version.
    fn schema(&self) -> Option<&StructType>;

    /// Table [`Metadata`] at this [`Snapshot`]'s version.
    fn metadata(&self) -> DeltaResult<Metadata>;

    /// Table [`Protocol`] at this [`Snapshot`]'s version.
    fn protocol(&self) -> DeltaResult<Protocol>;

    /// Iterator over the [`Add`] actions at this [`Snapshot`]'s version.
    fn files(&self) -> DeltaResult<Box<dyn Iterator<Item = Add> + '_>>;

    /// Well known table [configuration](crate::table::config::TableConfig).
    fn table_config(&self) -> TableConfig<'_>;
}

log replay / v2 checkpoints

  • handle compaction actions
  • parse checkpoints with v2 metadata (CheckpointMetadata and Sidecar actions)
  • add a new log path abstraction to set us on a path for supporting urls in delta log.
pub enum LogPath {
    ObjectStore(Path),
    Url(Url),
}

Usage is still fairly contained to the inner parts of delta-rs. Main usage right now is providing helpers for the various kinds of files in the log and using URLs will mostly just error. Nonetheless i felt it advantageous to start modelling this.

timestamps (moved to follow-up PR)

non surprisingly, changes to timestamp handling propagate quite a bit, so better to do a follow up. Keeping the notes for reference.

The delta protocol states, that timestamp types should be written to parquet with the isAdjustedToUtc flag set to true. Right now delta-rs does not honour this. Right now I opted for:

  • converting delta timestamps to arrow timestamps with tz="UTC"
  • allow for converting arrow timestamps with utc to delta timestamps (previously only timestamps with tz = None were parsed)
  • maintaining the behaviour of converting arrow timestamps with tz = None to delta timestamps until we implement timestampNtz.

@github-actions github-actions bot added binding/rust Issues for the Rust crate crate/core crate/sql binding/python Issues for the Python package labels Nov 11, 2023
@github-actions github-actions bot removed the binding/python Issues for the Python package label Nov 11, 2023
@houqp
Copy link
Member

houqp commented Nov 20, 2023

Really exciting work, glad to see this finally happen! Any early memory consumption benchmark results to share?

rtyler added a commit that referenced this pull request Jan 23, 2024
# Description

This is still very much a work in progress, opening it up for visibility
and discussion.

Finally I do hope that we can make the switch to arrow based log
handling. Aside from hopefully advantages in the memory footprint, I
also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs 

- `Snapshot` - a half lazy version of the Snapshot, which only tries to
get `Protocol` & `Metadata` actions ASAP. Of course these drive all our
planning activities and without them there is not much we can do.
- `EagerSnapshot` - An intermediary structure, which eagerly loads file
actions and does log replay to serve as a compatibility laver for the
current `DeltaTable` APIs.

One conceptually larger change is related to how we view the
availability of information. Up until now `DeltaTableState` could be
initialized empty, containing no useful information for any code to work
with. State (snapshots) now always needs to be created valid. The thing
that may not yet be initialized is the `DeltaTable`, which now only
carries the table configuration and the `LogStore`. the state / snapshot
is now optional. Consequently all code that works against a snapshot no
longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already
are working against snapshots mostly, but should abolish most traits
implemented for `DeltaTable` as this does not provide the information
(and never has) that is al least required to execute a query.

Some larger notable changes include:

* remove `DeltaTableMetadata` and always use `Metadata` action.
* arrow and parquet are now required, as such the features got removed.
Personalyl I would also argue, that if you cannot read checkpoints, you
cannot read delta tables :). - so hopefully users weren't using
arrow-free versions.

### Major follow-ups:

* (pre-0.17) review integration with `log_store` and `object_store`.
Currently we make use mostly of `ObjectStore` inside the state handling.
What we really use is `head` / `list_from` / `get` - my hope would be
that we end up with a single abstraction...
* test cleanup - we are currently dealing with test flakiness and have
several approaches to scaffolding tests. SInce we have the
`deltalake-test` crate now, this can be reconciled.
* ...
* do more processing on borrowed data ...
* perform file-heavy operations on arrow data
* update checkpoint writing to leverage new state handling and arrow ...
* switch to exposing URL in public APIs

## Questions

* should paths be percent-encoded when written to checkpoint?

# Related Issue(s)

supersedes: #454
supersedes: #1837
closes: #1776
closes: #425 (should also be addressed in the current implementation)
closes: #288 (multi-part checkpoints are deprecated)
related: #435

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
@roeap
Copy link
Collaborator Author

roeap commented Jan 26, 2024

superseded by #2037

@roeap roeap closed this Jan 26, 2024
@roeap roeap deleted the snapshots branch January 26, 2024 05:24
RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this pull request Feb 2, 2024
# Description

This is still very much a work in progress, opening it up for visibility
and discussion.

Finally I do hope that we can make the switch to arrow based log
handling. Aside from hopefully advantages in the memory footprint, I
also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs 

- `Snapshot` - a half lazy version of the Snapshot, which only tries to
get `Protocol` & `Metadata` actions ASAP. Of course these drive all our
planning activities and without them there is not much we can do.
- `EagerSnapshot` - An intermediary structure, which eagerly loads file
actions and does log replay to serve as a compatibility laver for the
current `DeltaTable` APIs.

One conceptually larger change is related to how we view the
availability of information. Up until now `DeltaTableState` could be
initialized empty, containing no useful information for any code to work
with. State (snapshots) now always needs to be created valid. The thing
that may not yet be initialized is the `DeltaTable`, which now only
carries the table configuration and the `LogStore`. the state / snapshot
is now optional. Consequently all code that works against a snapshot no
longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already
are working against snapshots mostly, but should abolish most traits
implemented for `DeltaTable` as this does not provide the information
(and never has) that is al least required to execute a query.

Some larger notable changes include:

* remove `DeltaTableMetadata` and always use `Metadata` action.
* arrow and parquet are now required, as such the features got removed.
Personalyl I would also argue, that if you cannot read checkpoints, you
cannot read delta tables :). - so hopefully users weren't using
arrow-free versions.

### Major follow-ups:

* (pre-0.17) review integration with `log_store` and `object_store`.
Currently we make use mostly of `ObjectStore` inside the state handling.
What we really use is `head` / `list_from` / `get` - my hope would be
that we end up with a single abstraction...
* test cleanup - we are currently dealing with test flakiness and have
several approaches to scaffolding tests. SInce we have the
`deltalake-test` crate now, this can be reconciled.
* ...
* do more processing on borrowed data ...
* perform file-heavy operations on arrow data
* update checkpoint writing to leverage new state handling and arrow ...
* switch to exposing URL in public APIs

## Questions

* should paths be percent-encoded when written to checkpoint?

# Related Issue(s)

supersedes: delta-io#454
supersedes: delta-io#1837
closes: delta-io#1776
closes: delta-io#425 (should also be addressed in the current implementation)
closes: delta-io#288 (multi-part checkpoints are deprecated)
related: delta-io#435

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants