Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement delta log checkpointing #106

Closed
xianwill opened this issue Mar 3, 2021 · 0 comments · Fixed by #280
Closed

Implement delta log checkpointing #106

xianwill opened this issue Mar 3, 2021 · 0 comments · Fixed by #280
Assignees
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed

Comments

@xianwill
Copy link
Collaborator

xianwill commented Mar 3, 2021

JSON delta log entries should be consolidated into a parquet snapshot periodically so that log readers can deserialize the log quickly. The spark delta reference implementation consolidates json log entries into parquet checkpoint snapshots after every 10th JSON log formatted log entry. This behavior is also mentioned under the Checkpoints section in the delta protocol. delta-rs should offer a utility function to create parquet checkpoint snapshots similarly. An MVP implementation should minimally provide a parquet checkpoint utility function so that callers can schedule this appropriately.

@houqp houqp added binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed labels Mar 3, 2021
@xianwill xianwill self-assigned this May 24, 2021
bbigras added a commit to bbigras/delta-rs that referenced this issue Sep 30, 2021
wjones127 added a commit that referenced this issue Sep 30, 2022
# Description

This is the published version of the google doc
https://docs.google.com/document/d/1iKVdnilMCS6qziBBVohXw2-UX_Ld2K6AAjLTZHcYhQ0/edit#heading=h.iezbyetoxtf

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
fvaleye pushed a commit that referenced this issue Nov 16, 2022
# Description
Update datafusion and arrow dependecies to latest versions.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
wjones127 pushed a commit that referenced this issue Nov 17, 2022
)

# Description
When passing `parquet_read_options` to `to_pyarrow_dataset` it is now
possible to use `dictionary_columns` to control which columns should be
dictionary encoded as they are read.

# Related Issue(s)
- closes #938 
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
roeap added a commit that referenced this issue Nov 17, 2022
# Description

This PR builds in top of the changes to handling the runtime in #933. In
my local tests this fixed #915. Additionally, I added the runtime as a
property on the fs handler to avoid re-creating it on every call. In
some non-representative tests with a large number of very small
partitions it cut the runtime in about half.

cc @wjones127 

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
roeap added a commit that referenced this issue Nov 22, 2022
# Description

Adding a simple example how to use operations APIs to create and work
with delta tables and some minor documentation tweaks.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->

Co-authored-by: Will Jones <willjones127@gmail.com>
roeap added a commit that referenced this issue Dec 3, 2022
# Description

This PR integrates the `DeltaDataChecker` in the write path of the
operations API. I was unsure whether to integrate this in the
(experimental) writer in the operations module, but opted for keeping
the writer itself focussed on the lower level write operations.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
wjones127 pushed a commit that referenced this issue Dec 9, 2022
# Description

This loosen the version requirement for maturin when building the python
package.

# Related Issue(s)

- closes #1004
<!---
For example:

- closes #106
--->

# Documentation

Using an exact version can be excessively restrictive, for example
preventing the use of the latest (but still compatible) version of
maturin. Especially when packaging for rolling-releases distributions
(like Archlinux), being able to build with the latest version would be
beneficial.
<!---
Share links to useful documentation
--->
wjones127 pushed a commit that referenced this issue Dec 19, 2022
# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes #106
--->
Closes #971 
# Documentation

<!---
Share links to useful documentation
--->
wjones127 pushed a commit that referenced this issue Dec 19, 2022
# Description
The description of the main changes of your pull request

Use [labeler](https://github.com/actions/labeler) to automatically label
PRs. There are language labels, CI, documentation as well as
crate-specific labels.
# Related Issue(s)
<!---
For example:

- closes #106
--->
Closes #1026 & unblocks #997.
# Documentation

<!---
Share links to useful documentation
--->
wjones127 added a commit that referenced this issue Dec 21, 2022
# Description
Fix lint issues from latest cargo release.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
wjones127 added a commit that referenced this issue Dec 21, 2022
…979)

# Description

This is intended to fix the issue where the Python CI job is frequently
failing. I believe it's because the Docker services haven't fully
started up before the tests start running. To address this, I added a
function to wait for the services to be responsive.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
fvaleye pushed a commit that referenced this issue Dec 27, 2022
# Description

This PR consolidates the four methods `files()`, `file_paths()`,
`file_uris()`, `files_by_partitions()` into just two methods:

* `files()` -> which returns paths as they are in the Delta Log (usually
relative, but *can* be absolute, particularly if they are located
outside of the delta table root).
 * `file_uris()`, which returns absolute URIs for all files.

Both of these now take the `partition_filters` parameter, making
`files_by_partitions()` obsolete. That latter function has been marked
deprecated, but it also returns it to its original behavior of returning
absolute file paths and not relative ones, resolving #894.

Finally, the `partition_filters` parameter now supports passing values
other than strings, such as integers and floats.

TODO:

 * [x] Update documentation
* [ ] ~~Test behavior of filtering for null or non-null~~ Null handling
isn't supported by DNF filters IIUC
 * [x] Test behavior of paths on object stores.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
roeap added a commit that referenced this issue Jan 5, 2023
# Description

Moving the `vacuum` operation into the operations module and adopting
`IntoFuture` for the command builder. This is breaking the APIs for the
builder (now with consistent setter names) but we are able to keep the
APIs for `DeltaTable` in rust and python.

In a follow up I would like to move th optimize command as well, This
however may require refactoring the `PartitionValue` since we can only
deal with `static` lifetimes when using `IntoFuture`, A while back we
talked about pulling in `ScalarValue` from datafusion to optimize that
implementation and maybe that's a good opportunitiy to look into that as
well.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->

Co-authored-by: Will Jones <willjones127@gmail.com>
wjones127 added a commit that referenced this issue Jan 17, 2023
# Description

Considering adding continuous benchmarks to Python reader / writer.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
wjones127 added a commit that referenced this issue Jan 26, 2023
# Description

In preparation for new release.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
roeap added a commit that referenced this issue Jan 27, 2023
# Description

The latest rust release comes with new more opinionated clippy :). This
PR fixes the new clippy errors and and runs `cargo clippy --fix` on all
our crates.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
wjones127 pushed a commit that referenced this issue Jan 28, 2023
Signed-off-by: Marijn Valk <marijncv@hotmail.com>

# Description
Adds a test for the `left_larger_than_right` function and rewrites the
function match expression to match on both the `left` and `right`
argument

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Signed-off-by: Marijn Valk <marijncv@hotmail.com>
roeap added a commit that referenced this issue Feb 1, 2023
# Description

A simple maintenance PR to update datafusion to the latest version. 

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
wjones127 added a commit that referenced this issue Oct 30, 2023
# Description
This PR exposes the FSCK operation as a `repair` method under the
`DeltaTable `class.

# Related Issue(s)
<!---
For example:

- closes #106
--->
- closes #1727

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
roeap added a commit that referenced this issue Oct 31, 2023
# Description

This is an fairly early draft to create logical plans from sql using the
datafusion abstractions. Adopted the patterns over there quite closely
since the ultimate goal would be to ask the datafusion community if they
would accept these changes within the datafusion sql crate ...

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
ryanaston pushed a commit to segmentio/delta-rs that referenced this issue Nov 1, 2023
* feat: extend unit catalog support

* chore: draft datafusion integration

* fix: allow passing catalog options from python

* chore: clippy

* feat: add more azure credentials

* fix: add defaults for return types

* fix: simpler defaults

* Update rust/src/data_catalog/unity/mod.rs

Co-authored-by: nohajc <nohajc@gmail.com>

* fix: imports

* fix: add some defaults

* test: add failing provider test

* feat: list catalogs

* merge main

* fix: remove artifact

* fix: errors after merge with main

* Start python api docs

* docs: update Readme (delta-io#1440)

# Description

With summit coming up I thought we might update our README, since
delta-rs has evolved quite a bit since the README was first written...

Just opening the Draft to get feedback on the general "patterns" i.e.
how the tables are formatted, how detailed we want to show the features
and mostly the looks of the header.

Also hoping our community experts may have some content they wat to add
here 😆.

cc @dennyglee @MrPowers @wjones127 @rtyler @houqp @fvaleye

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

* Pin chrono to 0.4.30

v0.4.31 was just released which introduces some spurious deprecation warnings

* docs: update Readme (delta-io#1633)

# Description
- Changed the icons as, at first glance, it looked like AWS was not
supported (in blue), while the green open icon looked like it was
completed
- Added one line linking to the Delta Lake docker
- Fixed some minor grammar issues

Including community experts @roeap @MrPowers @wjones127 @rtyler @houqp
@fvaleye to ensure these updates make sense. Thanks!

* chore: update datafusion to 31, arrow to 46 and object_store to 0.7 (delta-io#1634)

# Description

Update datafusion to 31

* chore: relax chrono pin to 0.4 (delta-io#1635)

# Description

relax chrono pin to improve downstream compatibility.

* make create_checkpoint_for public

* add documentation to create_checkpoint_for

* Implement parsing for the new `domainMetadata` actions in the commit log

The Delta Lake protocol which will be released in conjunction with "3.0.0"
(currently at RC1) introduces `domainMetadata` actions to the commit log to
enable system or user-provided metadata about the commits to be added to the
log. With DBR 13.3 in the Databricks ecosystem, tables are already being written
with this action via the "liquid clustering" feature.

This change enables the clean reading of these tables, but at present nothing
novel is done with this information.

[Read more here](https://www.databricks.com/blog/announcing-delta-lake-30-new-universal-format-and-liquid-clustering)

Fixes delta-io#1626

Sponsored-by: Databricks Inc

* fix: include in-progress row group when calculating in-memory buffer length (delta-io#1638)

# Description
`PartitionWriter.buffer_len()` is documented as returning: 

> the current byte length of the in memory buffer.

However, this doesn't currently include the length of the in-progress
row group. This means that until a row group is flushed, `buffer_len()`
returns `0`. Based on the documented description, its length should
probably include the bytes currently in-memory as part of an unflushed
row group.

`buffered_record_batch_count` _does_ include in-progress row groups, so
this change also means record count and buffered bytes are reported
consistently.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->
- closes delta-io#1637

# Documentation

<!---
Share links to useful documentation
--->

[`buffer_len` on
`RecordBatchWriter`](https://docs.rs/deltalake/0.15.0/deltalake/writer/record_batch/struct.RecordBatchWriter.html#method.buffer_len)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* feat: allow multiple incremental commits in optimize

Currently "optimize" executes the whole plan in one commit, which might
fail. The larger the table, the more likely it is to fail and the more
expensive the failure is.

Add an option in OptimizeBuilder that allows specifying a commit
interval. If that is provided, the plan executor will periodically
commit the accumulated actions.

* fix: explicitly require chrono 0.4.31 or greater

The Python binding relies on `timestamp_nanos)opt()` which requires 0.4.31 or
greater from chroni since it did not previously exist.

As a [cargo dependency
refresher](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies-from-cratesio)
this version range is >=0.4.31, < 0.5.0 which is I believe what we need for
optimal downstream compatibility.

* Correct some merge related errors with redundant package names from the workspace

* Address some latent clippy failures after merging main

* Correct the incorrect documentation for `Backoff`

* fix: avoid excess listing of log files

* feat: pass known file sizes to filesystem in Python (delta-io#1630)

# Description
Currently the Filesystem implementation always makes a HEAD request when
opening a file, to determine the file size. The proposed change is to
read the file sizes from the delta log instead, and to pass them down to
the `open_input_file` call, eliminating the HEAD request.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

* Proposed updated CODEOWNERS to allow better review notifications

Based on current pull request feedback and maintenance trends I'm suggesting
these rules to get the right people on the reviews by default.

Closes delta-io#1553

* fix: add support for Microsoft OneLake

This change introduces tests and support for Microsoft OneLake. This specific
commit is a rebase of the work done by our pals at Microsoft.

Co-authored-by: Mohammed Muddassir <v-mmuddassir@microsoft.com>
Co-authored-by: Christopher Watford <christopher.watford@kcftech.com>

* Ignore failing integration tests which require a special environment to operate

The OneLake support should be considered unsupported and experimental until such
time when we can add integration testing to our CI process

* Compensate for invalid log files created by Delta Live Tables

It would appear that in some cases Delta Live Tables will create a Delta table
which does not adhere to the Delta Table protocol.

The metaData action as a **required** `schemaString` property which simply
doesn't exist. Since it appears that this only exists at version zero of the
transaction log, and the _actual_ schema exists in the following versions of the
table (e.g. 1), this change introduces a default deserializer on the MetaData
action which provides a simple empty schema.

This is an alternative implementation to delta-io#1305 which is a bit more invasive and
makes our schema_string struct member `Option<String>` which I do not believe is
worth it for this unfortunate compatibility issue

Closes delta-io#1305, delta-io#1302, delta-io#1357

Sponsored-by: Databricks Inc

* chore: fix the incorrect Slack link in our readme

not sure what the deal with the go.delta.io service, no idea where that lives

Fixes delta-io#1636

* enable offset listing for s3

* Make docs.rs build docs with all features enabled

I was confused that I could not find the documentation integrating datafusion with delta-rs.

With this PR, everything should show up. Perhaps docs for a feature gated method should also mention which feature is required. Similar to what Tokio does. Perhaps it could be done in followup PRs.

* feat: expose min_commit_interval to `optimize.compact` and `optimize.z_order` (delta-io#1645)

# Description
Exposes min_commit_interval in the Python API to `optimize.compact` and
`optimize.z_order`. Added one test-case to verify the
min_commit_interval.

# Related Issue(s)
closes delta-io#1640

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* docs: add docstring to protocol method (delta-io#1660)

* fix: percent encoding of partition values and paths

* feat: handle path encoding in serde and encode partition values in file names

* fix: always unquote partition values extracted from path

* test: add tests for related issues

* fix: consistent serialization of partition values

* fix: rounbdtrip special characters

* chore: format

* fix: add feature requirement to load example

* test: add timestamp col to partitioned roundtrip tests

* test: add rust roundtip test for special characters

* fix: encode characters illegal on windows

* docs: fix some typos (delta-io#1662)

# Description
Saw two typos and marking merge in rust as half-done with a comment on
it's current limitation.

* feat: use url parsing from object store

* fix: ensure config for ms fabric

* chore: drive-by simplify test files

* fix: update aws http config key

* fix: feature gate azure update

* feat: more robust azure config handling

* fix: in memory store handling

* feat: use object-store's s3 store if copy-if-not-exists headers are specified (delta-io#1356)

* refactor: re-organize top level modules (delta-io#1434)

# Description

~This contains changes from delta-io#1432, will rebase once that's merged.~

This PR constitutes the bulk of re-organising our top level modules.
- move `DeltaTable*` structs into new `table` module
- move table configuration into `table` module
- move schema related modules into `schema` module
- rename `action` module to `protocol` - hoping to isolate everything
that can one day be the log kernel.

~It also removes the deprecated commit logic from `DeltaTable` and
updates call sites and tests accordingly.~

I am planning one more follow up, where I hope to make `transactions`
currently within `operations` a top level module. While the number of
touched files here is already massive, I want to do this in a follow up,
as it will also include some updates to the transactions itself, that
should be more carefully reviewed.

# Related Issue(s)

closes: delta-io#1136

# Documentation

<!---
Share links to useful documentation
--->

* chore: increment python library version (delta-io#1664)

# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

* fix exception string in writer.py

The exception message is ambiguous as it interchanges the table and data schemas.

* Update docs

* add read me

* Add space

* feat: allow to set large dtypes for the schema check in `write_deltalake` (delta-io#1668)

# Description
Currently it was always checking the schema for non-large types, I
didn't know before we could change it so in polars we added some schema
casting from large to non-large, this however became a problem today
when I wanted to write 200M records at once because the array was too
big the fit in normal string type.

```python
ArrowInvalid: Failed casting from large_string to string: input array too large
```

Adding this flag will allow libraries like polars to write directly with
their large dtypes in arrow. If this is merged, I can work on fix in
polars to remove the schema casting for these large types.

* fix: change partitioning schema from large to normal string for pyarrow<12 (delta-io#1671)

# Description
If pyarrow is below v12.0.0 it changes the partitioning schema fields
from large_string to string.

# Related Issue(s)
closes delta-io#1669 

# Documentation
apache/arrow#34546 (comment)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* chore: bump rust crate version

* fix: use epoch instead of ce for date stats (delta-io#1672)

# Description
date32 statistics logic was subjectively wrong. It was using
`from_num_days_from_ce_opt` which
> Makes a new NaiveDate from a day's number in the proleptic Gregorian
calendar, with January 1, 1 being day 1.

while date32 is commonly represented as days since UNIX epoch
(1970-01-01)



# Related Issue(s)
closes delta-io#1670

# Documentation
It doesn't seem like parquet actually has a spec for what a `date`
should be, but many other tools seem to use the epoch logic.

duckdb, and polars seem to use epoch instead of gregorian. 

Also arrow spec states that date32 should be epoch.

for example, if i write using polars
```py
import polars as pl

# %%
df = pl.DataFrame(
    {
        "a": [
            10561,
            9200,
            9201,
            9202,
            9203,
            9204,
            9205,
            9206,
            9207,
            9208,
            9199,
        ]
    }
)
# %%

df.select(pl.col("a").cast(pl.Date)).write_delta("./db/polars/")
```
the stats are correctly interpreted
```
{"add":{"path":"0-7b8f11ab-a259-4673-be06-9deedeec34ff-0.parquet","size":557,"partitionValues":{},"modificationTime":1695779554372,"dataChange":true,"stats":"{\"numRecords\": 11, \"minValues\": {\"a\": \"1995-03-10\"}, \"maxValues\": {\"a\": \"1998-12-01\"}, \"nullCount\": {\"a\": 0}}"}}
```

* chore: update changelog for the rust-v0.16.0 release

* Remove redundant changelog entry for 0.16

* update readme

* fix: update the delta-inspect CLI to be build again by Cargo

This sort of withered on the vine a bit, this pull request allows it to be built
properly again

* update readme

* chore: bump the version of the Rust crate

* fix: unify environment variables referenced by Databricks docs

Long-term fix will be for Databricks to release a Rust SDK for Unity 😄

Fixes delta-io#1627

* feat: support CREATE OR REPLACE

* docs: get docs.rs configured correctly again (delta-io#1693)

# Description

The docs build was changed in delta-io#1658 to compile on docs.rs with all
features, but our crate cannot compile with all-features due to the TLS
features, which are mutually exclusive.

# Related Issue(s)

For example:

- closes delta-io#1692

This has been tested locally with the following command:

```
cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental
```

* fix!: ensure predicates are parsable (delta-io#1690)

# Description
Resolves two issues that impact Datafusion implemented operators

1. When a user has an expression with a scalar built-in scalar function
we are unable parse the output predicate since the
`DummyContextProvider`'s methods are unimplemented. The provider now
uses the user provided state or a default. More work is required in the
future to allow a user provided Datafusion state to be used during the
conflict checker.

2. The string representation was not parsable by sqlparser since it was
not valid SQL. New code was written to transform an expression into a
parsable sql string. Current implementation is not exhaustive however
common use cases are covered.

The delta_datafusion.rs file is getting large so I transformed it into a
module.

This implementation makes reuse of some code from Datafusion. I've added
the Apache License at the top of the file. Let me know if any else is
required to be compliant.


# Related Issue(s)
- closes delta-io#1625

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* fix typo in readme

* fix: address formatting errors

* fix: remove an unused import

* feat(python): expose delete operation (delta-io#1687)

# Description
Naively expose the delete operation, with the option to provide a
predicate.

I first tried to expose a richer API with the Python `FilterType` and
DNF expressions, but from what I understand delta-rs doesn't implement
generic filters but only `PartitionFilter`. The `DeleteBuilder` also
only accepts datafusion expressions. So Instead of hacking my way around
or proposing a refactor I went for the simpler approach of sending a
string predicate to the rust lib.

If this implementation is OK I will add tests.

# Related Issue(s)
- closes delta-io#1417

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* docs(python): document the delete operation

* Introduce some redundant type definitions to the mypy stub

* chore: fix new clippy lints introduced in Rust 1.73

* Update the sphinx ignore for building

=_=

* Enable prebuffer

* implement issue 1169

* fix format

* feat: add version number in `.history()` and display in reversed chronological order (delta-io#1710)

# Description
Adds the version number to each commit info.

# Related Issue(s)
<!---
For example:

- closes delta-io#106 
--->
- Closes delta-io#1561
- Closes delta-io#1680

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

* feat(python): expose UPDATE operation (delta-io#1694)

# Description

- Exposes UPDATE operation to Python.
- Added two test cases, with predicate and without
- Took some learnings in simplifying the code (will apply it in MERGE PR
as well)


# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

Closes delta-io#1505

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* fix: merge operation with string predicates (delta-io#1705)

# Description
Fixes an issue when users use string predicates with the merge
operation.

Parsing a string predicate did not properly handle table references and
would always assume a bare table with a table name of the empty string.
Now the qualifier is `None` however a `DFSchema` with qualifiers can be
supplied where it makes sense.

Now users must provide source and target aliases whenever both sides
share a column name otherwise the operation will error out.

Minor refactoring of the expression parser was also done and allowed
using of case expressions.


# Related Issue(s)
- closes delta-io#1699

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* refactor!: remove a layer of lifetimes from PartitionFilter (delta-io#1725)

# Description
This commit removes a bunch of lifetime restrictions on the
`PartitionFilter` and `PartitionFilterValue` classes to make them easier
to use. While the original discussion in Slack and delta-io#1501 made mention of
using a reference type, there doesn't seem to a need for it. A
particular instance of a `PartitionFilter` is created once and just
borrowed and read for the remainder of its life.

Functions, when necessary continue to accept the non-container types
(i.e, `&str` and `&[&str]`), allowing their containerized counterparts
to continue working with them without needing to borrow or clone the
containers (i.e, `String` and `Vec<String>`).

# Related Issue(s)
- resolves delta-io#1501 

# Documentation

* feat(python): expose MERGE operation (delta-io#1685)

# Description
This exposes MERGE commands to the Python API. The updates and
predicates are first kept in the Class TableMerger and only dispatched
to Rust after `TableMerge.execute()`.

This was my first thought on how to implement it since I have limited
experience with Rust and PyO3 (still learning 😄). Maybe a more elegant
solution is that every class method on TableMerger is dispatched to Rust
and then the Rust MergeBuilder gets serialized and sent back to Python
(back and forth). Let me know your thoughts on this. If this is better,
I could also do this in the next PR, so we at least can push this one
out sooner.

Couple of issues at the moment, I need feedback on, where the first one
is blocking since I can't test it now:

~- Source_alias is not applying, somehow during a schema check the
prefix is missing, however when I printed the lines inside merge, it
showed the prefix correctly. So not sure where the issue is~
~- I had to make datafusion_utils public since I needed to get the
Expression Struct from it, is this the right way to do that? @Blajda~

Edit:
I will pull @Blajda's changes
delta-io#1705 once merged with develop:


# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->
closes  delta-io#1357

* chore: remove deprecated functions

* chore: bump the python package version (delta-io#1734)

# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

* fix: reorder encode_partition_value() checks and add tests (delta-io#1733)

# Description
The `isinstance(val, datetime)` check was after `isinstance(val, date)`
which meant that it was never found. I added a test for each encoding
type.

---------

Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>

* Relax `pyarrow` pin

* fix: remove `pandas` pin (delta-io#1746)

# Description

Removes the `pandas` pin.

# Related Issue(s)

Resolves delta-io#1745

* docs: get docs.rs configured correctly again (delta-io#1693)

# Description

The docs build was changed in delta-io#1658 to compile on docs.rs with all
features, but our crate cannot compile with all-features due to the TLS
features, which are mutually exclusive.

# Related Issue(s)

For example:

- closes delta-io#1692

This has been tested locally with the following command:

```
cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental
```

* Make this a patch release to fix docs.rs

* Remove the hdfs feature from the docsrs build

* refactor!: update operations to use delta scan (delta-io#1639)

# Description
Recently implemented operations did not use `DeltaScan` it had some
gaps. These gaps would make it harder switch towards logical plans which
is required for merge.

Gaps:
- It was not possible to include file lineage in the result
- The subset of files to be scanned is known ahead of time. Users had to
reconstruct a parquet scan based on those files

The PR introduces a `DeltaScanBuilder` that allow users to specify which
files to use when constructing the scan, if the scan should be enhanced
to include additional metadata columns, and allows a projection to be
specified. It also retains previous functionality of pruning based on
the provided filter when files to scan are not provided.

`DeltaScanConfig` is also introduced which allows users to deterministic
obtain the names of any added metadata columns or allows them to specify
the name if required.

The public interface for `find_files` has changed but functionality
remains the same.

A new table provider was introduced which accepts an `DeltaScanConfig`.
This is required for future merge enhancements so unmodified files can
be pruned pruned prior to writes.

---------

Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>

* chore: update datafusion (delta-io#1741)

Updates arrow and datafusion dependencies to latest.

* docs: convert docs to use mkdocs (delta-io#1731)

# Description
Completed the outstanding tasks in delta-io#1708

Also changed theme from readthedocs to mkdocs - both are built-in but
latter looks sleeker

# Related Issue(s)
closes delta-io#1708

---------

Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

* docs: dynamodb lock configuration (delta-io#1752)

# Description
I have added documentation in the API and also on the Python usage page
regarding this configuration. Please let me know if it is satisfactory,
and if not, I am more than happy to address any issues or make any
necessary adjustments.

# Related Issue(s)
- closes delta-io#1674

# Documentation

* feat: ignore binary columns for stats generation

* feat: honor appendOnly table config (delta-io#1747)

# Description
Throw an error if a transaction includes Remove action with data change
but the Delta Table is append-only.

# Related Issue(s)
- closes delta-io#352

* chore: fix building/running tests without the datafusion feature

This looks like an oversight that our CI didn't test because we have the
datafusion feature typically enabled for our tests. The build error would only
show up when building tests without it.

* add write support explicitly for pyarrow dataset

* feat(python): expose FSCK (repair) operation  (delta-io#1730)

# Description
This PR exposes the FSCK operation as a `repair` method under the
`DeltaTable `class.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->
- closes delta-io#1727

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* refactor: perform bulk deletes during metadata cleanup

In addition to doing bulk deletes, I removed what seems like (at least to me)
unnecessary code. At it's core, files are considered up for deletion
when their last_modified time is older than the cutoff time AND the version
if less than the specific version (usually the latest version).

* Make an attempt at improving the utilization of delete_stream for cleaning up expired logs

This change builds on @cmackenzie1's work and feeds the list stream directly into
the delete_stream with a predicate function to identify paths for deletion

* start to add vacuum into transaction log

* add vacuum operations in transaction log

* attempt to calculate size

* add test

* chore: bump Python package version

* fix: ignore inf in stats

* doc(README): remove typo

* enhance docs to enable multi-lingual examples

* use official Python API for references

* chore: refactor into the deltalake meta crate and deltalake-core crates

This puts the groundwork in place for starting to partition into smaller crates
in a simpler and more manageable fashion.

See delta-io#1713

* Correct the working directory for the parquet2 tests

* feat: add deltalake sql crate (delta-io#1757)

# Description

This is an fairly early draft to create logical plans from sql using the
datafusion abstractions. Adopted the patterns over there quite closely
since the ultimate goal would be to ask the datafusion community if they
would accept these changes within the datafusion sql crate ...

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

* rollback resolve bucket region change

---------

Co-authored-by: Robert Pack <robstar.pack@gmail.com>
Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>
Co-authored-by: nohajc <nohajc@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Co-authored-by: Denny Lee <denny.g.lee@gmail.com>
Co-authored-by: QP Hou <dave2008713@gmail.com>
Co-authored-by: haruband <haruband@gmail.com>
Co-authored-by: Ben Magee <ben@bmagee.com>
Co-authored-by: Constantin S. Pan <kvapen@gmail.com>
Co-authored-by: Eero Lihavainen <eero.lihavainen@nitor.com>
Co-authored-by: Mohammed Muddassir <v-mmuddassir@microsoft.com>
Co-authored-by: Christopher Watford <christopher.watford@kcftech.com>
Co-authored-by: Simon Vandel Sillesen <simon.vandel@gmail.com>
Co-authored-by: Ion Koutsouris <ioncjk@gmail.com>
Co-authored-by: Matthew Powers <matthewkevinpowers@gmail.com>
Co-authored-by: Sébastien Diemer <diemersebastien@yahoo.fr>
Co-authored-by: Cory Grinstead <universalmind.candy@gmail.com>
Co-authored-by: Trinity Xia <trinityx@trinityacstudio.lan>
Co-authored-by: hnaoto <hnaoto@me.com>
Co-authored-by: universalmind303 <cory.grinstead@gmail.com>
Co-authored-by: David Blajda <db@davidblajda.com>
Co-authored-by: Josiah Parry <josiah.parry@gmail.com>
Co-authored-by: Guilhem de Viry <gdeviry@mytraffic.fr>
Co-authored-by: Nikolay Ulmasov <ulmasov@hotmail.com>
Co-authored-by: Cole Mackenzie <cole@cloudflare.com>
Co-authored-by: ldacey <lance.dacey@gmail.com>
Co-authored-by: Dave Hirschfeld <dave.hirschfeld@gmail.com>
Co-authored-by: David Blajda <blajda@hotmail.com>
Co-authored-by: Brayan Jules <brayanjuls@users.noreply.github.com>
Co-authored-by: emcake <3726783+emcake@users.noreply.github.com>
Co-authored-by: Junjun Dong <junjun.dong9@gmail.com>
Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Co-authored-by: Deep145757 <146447579+Deep145757@users.noreply.github.com>
wjones127 added a commit that referenced this issue Nov 4, 2023
…date()` (#1749)

# Description
A user can now add a new_values dictionary that contains python objects
as a value.


Some weird behavior's I noticed, probably related to datafusion,
updating a timestamp column has to be done by providing a unix timestamp
in microseconds. I personally find this very confusing, I was expecting
to be able to pass "2012-10-01" for example in the updates.

Another weird behaviour is with list of string columns. I can pass
`{"list_of_string_col":"[1,2,3]"}` or
`{"list_of_string_col":"['1','2','3']"}` and both will work. I expect
the first one to raise an exception on invalid datatypes. Combined
datatypes `"[1,2,'3']"` luckily do raise an error by datafusion.



# Related Issue(s)
<!---
For example:

- closes #106
--->
- closes #1740

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
wjones127 added a commit that referenced this issue Nov 4, 2023
# Description
You can now also do multiple when clauses just like in Rust and PySpark.
I added one test for now 😄, will add more later when I have some time.

I'll update the docs in another PR to reflect the possibility of this
behavior.

# Related Issue(s)
<!---
For example:

- closes #106
--->
- closes #1736

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
wjones127 added a commit that referenced this issue Nov 5, 2023
# Description

I build on top of the branch of @wjones127
#1602. In pyarrow v13+ the
ParquetWriter by default uses the `compliant_nested_types = True` (see
related PR: https://github.com/apache/arrow/pull/35146/files)and the
docs:
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html).

In arrow/parquet-rs it fails when it compares schemas because it
expected the old non-compliant ones. For now we can have pyarrow 13+
supported by disabling it or updating the file options provided by a
user.

# Related Issue(s)
<!---
For example:

- closes #106
--->

- Closes #1744

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
wjones127 added a commit that referenced this issue Nov 6, 2023
# Description

Yet again, one of the linux release builds broke.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
roeap added a commit that referenced this issue Nov 6, 2023
# Description

~~this PR depends on #1741.~~

Migrating the implementation of actions and schema over from kernel. The
schema is much more complete in terms of the more recent delta features
and more rigorously leverages the rust type system.


# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
roeap pushed a commit that referenced this issue Nov 12, 2023
# Description
Add a convert_to_delta operation for converting a Parquet table to a
Delta Table in place.

# Related Issue(s)
- closes #1041
- closes #1682
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco pushed a commit that referenced this issue Dec 2, 2023
# Description
Prepare for next release

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
Jan-Schweizer pushed a commit to Jan-Schweizer/delta-rs that referenced this issue Dec 2, 2023
# Description
Prepare for next release

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco added a commit that referenced this issue Dec 5, 2023
# Description

Latest Python release had a bunch of failures:
https://github.com/delta-io/delta-rs/actions/runs/7095801050

Also doing some general cleanup.

TODO:

 * [x] Figure out why Linux job failed

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
ion-elgreco added a commit that referenced this issue Jan 16, 2024
…s, to make sure tombstone and file paths match (#2035)

# Description
Percent-encoded file paths of Remove actions were not properly
deserialized, and when compared to active file paths, the paths didn't
match, which caused tombstones to be recognized as active files (be kept
in the state)

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Igor Borodin <igborodi@microsoft.com>
Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
natinimni pushed a commit to natinimni/delta-rs that referenced this issue Jan 31, 2024
…s, to make sure tombstone and file paths match (delta-io#2035)

Percent-encoded file paths of Remove actions were not properly
deserialized, and when compared to active file paths, the paths didn't
match, which caused tombstones to be recognized as active files (be kept
in the state)

<!---
For example:

- closes delta-io#106
--->

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Igor Borodin <igborodi@microsoft.com>
Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this issue Feb 2, 2024
…s, to make sure tombstone and file paths match (delta-io#2035)

# Description
Percent-encoded file paths of Remove actions were not properly
deserialized, and when compared to active file paths, the paths didn't
match, which caused tombstones to be recognized as active files (be kept
in the state)

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Igor Borodin <igborodi@microsoft.com>
Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
ion-elgreco pushed a commit that referenced this issue Mar 4, 2024
# Description
As requested by @ion-elgreco in #2229 , we should fix the formatter
versions

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco added a commit that referenced this issue Mar 21, 2024
# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco pushed a commit that referenced this issue Apr 1, 2024
# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco added a commit that referenced this issue Apr 22, 2024
# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco pushed a commit that referenced this issue Apr 23, 2024
)

# Description
The AWS SDK uses EC2 instance metadata in the default provider chain,
the profile chain and the region provider

# Related Issue(s)
<!---
For example:

- closes #106
--->
- closes #2377 
# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco pushed a commit that referenced this issue Jun 11, 2024
# Description
Updates the arrow and datafusion dependencies to 52 and 39(-rc1)
respectively. This is necessary for updating pyo3.

While most changes with trivial, some required big rewrites. Namely, the
logic for the Updates operation had to be rewritten (and simplified) to
accommodate some new sanity checks inside datafusion:
(apache/datafusion#10088).

Depends on delta-kernel having its arrow and object-store version bumped
as well. This PR doesn't include any major changes for pyo3, I'll open a
separate PR depending on this PR.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
ion-elgreco pushed a commit that referenced this issue Jun 14, 2024
# Description
This migrates the Python package to use the new pyo3 bounds-based API,
which allows more control over memory management on the library side and
theoretical performance improvements (I benchmarked, and didn't notice
anything substantial). The old API will be removed in 0.22.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco pushed a commit that referenced this issue Jun 19, 2024
# Description

Object stores expected fixed lengths for all multipart upload parts
right up until the last part. The original logic just flushed when it
exceeded the threshold. Now, it flushes when the threshold is met
exclusively with the same fixed buffer, unless we're completing the
transaction, in which case the last piece is allowed to be smaller.

Bumps the constant to reflect that the minimum expected size by most
object stores is 5MiB. Also adds a UserWarning if a constant is
specified to be less.

Also releases the GIL in more places by moving the flushing logic to a
free function.

# Related Issue(s)
<!---
For example:

- closes #106
--->

Closes #2605 

# Documentation

<!---
Share links to useful documentation
--->

See:
[MultipartUpload](https://docs.rs/object_store/latest/object_store/trait.MultipartUpload.html)
docs
ion-elgreco pushed a commit that referenced this issue Jun 21, 2024
# Description
Add support for HDFS using
[hdfs-native](https://github.com/Kimahriman/hdfs-native), a pure* Rust
client for interacting with HDFS. Creates a new `hdfs` sub-crate, adds
it as a feature to `deltalake` meta crate, and includes it in Python
wheels by default. There is a Rust integration test that requires Hadoop
and Java to be installed, and makes use of a small Maven program I ship
under the `integration-test` feature flag to run a MiniDFS server.

*Dynamically loads `libgssapi_krb5` using `libloading` for Kerberos
support

# Related Issue(s)
<!---
For example:

- closes #106
--->
Resolves #2611 

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco added a commit that referenced this issue Jun 24, 2024
# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
rtyler pushed a commit that referenced this issue Jul 18, 2024
…ipeline (#2679)

Currently, `ruff` and `mypy` have their latest versions installed in the
CI pipeline, while locally they are fixed to a specific version. This
can cause issues, see #2678.

This PR proposes to fix them to their specific version in the pipeline.
The alternative I could think of was installing the virtual environment
with `make develop`, but that takes between 4 and 5 minutes, which might
be considered a bit too long to wait on linting results.

This PR will have conflicts with
#2674, so I'll need to rebase
one of these PR's once the other is merged.

# Related Issue(s)

- closes [#106](#2678)
ion-elgreco pushed a commit that referenced this issue Jul 20, 2024
…peline (#2687)

# Description

The CI/CD pipeline currently contains some duplication; This PR proposes
to simplify that a bit by creating a reusable action to set up Python
and Rust.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco pushed a commit that referenced this issue Jul 20, 2024
# Description

This PR proposes to add a `make test-cov` command to the `Makefile` to
make it easier for contributors to check test coverage.

Output currently looks as follows:

```
[...]
tests/test_writerproperties.py::test_write_with_writerproperties PASSED    

---------- coverage: platform darwin, python 3.11.2-final-0 ----------
Name                        Stmts   Miss Branch BrPart  Cover
-------------------------------------------------------------
deltalake/__init__.py          11      0      0      0   100%
deltalake/_util.py             16      1     12      1    93%
deltalake/data_catalog.py       6      0      0      0   100%
deltalake/exceptions.py         5      0      0      0   100%
deltalake/fs.py                44     11      2      0    76%
deltalake/schema.py            60      0     22      0   100%
deltalake/table.py            431     42    174     23    89%
deltalake/writer.py           267     98    175     14    58%
-------------------------------------------------------------
TOTAL                         840    152    385     38    78%
Coverage HTML written to dir htmlcov
```

Also removed `--cov=deltalake` from the `pytest` ini commands, because I
think that is better moved to `[tool.coverage.run]`.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco pushed a commit that referenced this issue Jul 21, 2024
# Description
part of #2686  fix writing empty arrow dataset with pyarrow engine 

# Related Issue(s)
<!---
For example:

- closes #106
--->
part of #2686 

# Documentation

<!---
Share links to useful documentation
--->
ion-elgreco pushed a commit that referenced this issue Jul 21, 2024
# Description
The codebase contains the following code twice:

```
if sys.version_info >= (3, 8):
    from typing import Literal
else:
    from typing_extensions import Literal
```

I believe this can be removed, since
[pyproject.toml](https://github.com/delta-io/delta-rs/blob/f432c4f8337c2b0d47958645684e5df336c61522/python/pyproject.toml#L10)
specifies that the minimum Python version for the project is 3.8:

```toml
requires-python = ">=3.8"
```

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
2 participants