Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support merge (upsert) in Python #1357

Closed
hongbo-miao opened this issue May 11, 2023 · 12 comments · Fixed by #1685
Closed

Support merge (upsert) in Python #1357

hongbo-miao opened this issue May 11, 2023 · 12 comments · Fixed by #1685
Assignees
Labels
enhancement New feature or request

Comments

@hongbo-miao
Copy link

hongbo-miao commented May 11, 2023

Description

Use Case

I have dataframe with a few columns timestamp, current, voltage, temperature.
I hope timestamp to be unique, like a primary key.

Currently below code will keep appending if I run same code on same dataframe.

write_deltalake("s3a://my-bucket/my-delta-tables/motor", df, mode="append")

It would be great to support merge (upsert) like this https://docs.databricks.com/delta/merge.html
Thanks! 😃

Related Issue(s)

Originally asked at #1355

@mohitbansal-gep
Copy link

Hi Team , any updates, of when can this be made available in python

@wjones127
Copy link
Collaborator

This is blocked on implementing this in Rust, tracked in #850.

Hi Team , any updates, of when can this be made available in python

I have no timeline yet. No one has started working on this yet.

@ion-elgreco
Copy link
Collaborator

@wjones127 can you assign this to me? Would like to give this a try next few weekends :)

rtyler added a commit to rtyler/delta-rs that referenced this issue Sep 20, 2023
It would appear that in some cases Delta Live Tables will create a Delta table
which does not adhere to the Delta Table protocol.

The metaData action as a **required** `schemaString` property which simply
doesn't exist. Since it appears that this only exists at version zero of the
transaction log, and the _actual_ schema exists in the following versions of the
table (e.g. 1), this change introduces a default deserializer on the MetaData
action which provides a simple empty schema.

This is an alternative implementation to delta-io#1305 which is a bit more invasive and
makes our schema_string struct member `Option<String>` which I do not believe is
worth it for this unfortunate compatibility issue

Closes delta-io#1305, delta-io#1302, delta-io#1357

Sponsored-by: Databricks Inc
rtyler added a commit to rtyler/delta-rs that referenced this issue Sep 20, 2023
It would appear that in some cases Delta Live Tables will create a Delta table
which does not adhere to the Delta Table protocol.

The metaData action as a **required** `schemaString` property which simply
doesn't exist. Since it appears that this only exists at version zero of the
transaction log, and the _actual_ schema exists in the following versions of the
table (e.g. 1), this change introduces a default deserializer on the MetaData
action which provides a simple empty schema.

This is an alternative implementation to delta-io#1305 which is a bit more invasive and
makes our schema_string struct member `Option<String>` which I do not believe is
worth it for this unfortunate compatibility issue

Closes delta-io#1305, delta-io#1302, delta-io#1357

Sponsored-by: Databricks Inc
@pfwnicks
Copy link

any updates on this? This would be an amazing feature to have out of the box for python. Any open issues blocking this still?

@ion-elgreco
Copy link
Collaborator

any updates on this? This would be an amazing feature to have out of the box for python. Any open issues blocking this still?

I'm working on it but I expect it will take a couple weeks. I only have time to work on it in weekends

wjones127 pushed a commit that referenced this issue Oct 17, 2023
# Description
This exposes MERGE commands to the Python API. The updates and
predicates are first kept in the Class TableMerger and only dispatched
to Rust after `TableMerge.execute()`.

This was my first thought on how to implement it since I have limited
experience with Rust and PyO3 (still learning 😄). Maybe a more elegant
solution is that every class method on TableMerger is dispatched to Rust
and then the Rust MergeBuilder gets serialized and sent back to Python
(back and forth). Let me know your thoughts on this. If this is better,
I could also do this in the next PR, so we at least can push this one
out sooner.

Couple of issues at the moment, I need feedback on, where the first one
is blocking since I can't test it now:

~- Source_alias is not applying, somehow during a schema check the
prefix is missing, however when I printed the lines inside merge, it
showed the prefix correctly. So not sure where the issue is~
~- I had to make datafusion_utils public since I needed to get the
Expression Struct from it, is this the right way to do that? @Blajda~

Edit:
I will pull @Blajda's changes
#1705 once merged with develop:


# Related Issue(s)
<!---
For example:

- closes #106
--->
closes  #1357
@mohitbansal-gep
Copy link

When this will be released to pypi?

@ion-elgreco
Copy link
Collaborator

When this will be released to pypi?

Somewhere this week

@pfwnicks
Copy link

awesome job @ion-elgreco! Thanks for cranking this out!

@ahmad2080
Copy link

@ion-elgreco Thanks for enabling this!
Does this feature also enable schema merge? Something similar to this: https://docs.databricks.com/en/delta/update-schema.html#merge-schema-evolution

@ion-elgreco
Copy link
Collaborator

@ion-elgreco Thanks for enabling this!
Does this feature also enable schema merge? Something similar to this: https://docs.databricks.com/en/delta/update-schema.html#merge-schema-evolution

No only Merge operation.

@mohitbansal-gep
Copy link

Is the usage updated in Docs?

@ion-elgreco
Copy link
Collaborator

@mohitbansal-gep nope not yet. The API has examples for each command and you can find some more clear examples in the blog I wrote: https://delta.io/blog/2023-10-22-delta-rs-python-v0.12.0/

ryanaston pushed a commit to segmentio/delta-rs that referenced this issue Nov 1, 2023
* feat: extend unit catalog support

* chore: draft datafusion integration

* fix: allow passing catalog options from python

* chore: clippy

* feat: add more azure credentials

* fix: add defaults for return types

* fix: simpler defaults

* Update rust/src/data_catalog/unity/mod.rs

Co-authored-by: nohajc <nohajc@gmail.com>

* fix: imports

* fix: add some defaults

* test: add failing provider test

* feat: list catalogs

* merge main

* fix: remove artifact

* fix: errors after merge with main

* Start python api docs

* docs: update Readme (delta-io#1440)

# Description

With summit coming up I thought we might update our README, since
delta-rs has evolved quite a bit since the README was first written...

Just opening the Draft to get feedback on the general "patterns" i.e.
how the tables are formatted, how detailed we want to show the features
and mostly the looks of the header.

Also hoping our community experts may have some content they wat to add
here 😆.

cc @dennyglee @MrPowers @wjones127 @rtyler @houqp @fvaleye

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

* Pin chrono to 0.4.30

v0.4.31 was just released which introduces some spurious deprecation warnings

* docs: update Readme (delta-io#1633)

# Description
- Changed the icons as, at first glance, it looked like AWS was not
supported (in blue), while the green open icon looked like it was
completed
- Added one line linking to the Delta Lake docker
- Fixed some minor grammar issues

Including community experts @roeap @MrPowers @wjones127 @rtyler @houqp
@fvaleye to ensure these updates make sense. Thanks!

* chore: update datafusion to 31, arrow to 46 and object_store to 0.7 (delta-io#1634)

# Description

Update datafusion to 31

* chore: relax chrono pin to 0.4 (delta-io#1635)

# Description

relax chrono pin to improve downstream compatibility.

* make create_checkpoint_for public

* add documentation to create_checkpoint_for

* Implement parsing for the new `domainMetadata` actions in the commit log

The Delta Lake protocol which will be released in conjunction with "3.0.0"
(currently at RC1) introduces `domainMetadata` actions to the commit log to
enable system or user-provided metadata about the commits to be added to the
log. With DBR 13.3 in the Databricks ecosystem, tables are already being written
with this action via the "liquid clustering" feature.

This change enables the clean reading of these tables, but at present nothing
novel is done with this information.

[Read more here](https://www.databricks.com/blog/announcing-delta-lake-30-new-universal-format-and-liquid-clustering)

Fixes delta-io#1626

Sponsored-by: Databricks Inc

* fix: include in-progress row group when calculating in-memory buffer length (delta-io#1638)

# Description
`PartitionWriter.buffer_len()` is documented as returning: 

> the current byte length of the in memory buffer.

However, this doesn't currently include the length of the in-progress
row group. This means that until a row group is flushed, `buffer_len()`
returns `0`. Based on the documented description, its length should
probably include the bytes currently in-memory as part of an unflushed
row group.

`buffered_record_batch_count` _does_ include in-progress row groups, so
this change also means record count and buffered bytes are reported
consistently.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->
- closes delta-io#1637

# Documentation

<!---
Share links to useful documentation
--->

[`buffer_len` on
`RecordBatchWriter`](https://docs.rs/deltalake/0.15.0/deltalake/writer/record_batch/struct.RecordBatchWriter.html#method.buffer_len)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* feat: allow multiple incremental commits in optimize

Currently "optimize" executes the whole plan in one commit, which might
fail. The larger the table, the more likely it is to fail and the more
expensive the failure is.

Add an option in OptimizeBuilder that allows specifying a commit
interval. If that is provided, the plan executor will periodically
commit the accumulated actions.

* fix: explicitly require chrono 0.4.31 or greater

The Python binding relies on `timestamp_nanos)opt()` which requires 0.4.31 or
greater from chroni since it did not previously exist.

As a [cargo dependency
refresher](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies-from-cratesio)
this version range is >=0.4.31, < 0.5.0 which is I believe what we need for
optimal downstream compatibility.

* Correct some merge related errors with redundant package names from the workspace

* Address some latent clippy failures after merging main

* Correct the incorrect documentation for `Backoff`

* fix: avoid excess listing of log files

* feat: pass known file sizes to filesystem in Python (delta-io#1630)

# Description
Currently the Filesystem implementation always makes a HEAD request when
opening a file, to determine the file size. The proposed change is to
read the file sizes from the delta log instead, and to pass them down to
the `open_input_file` call, eliminating the HEAD request.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

* Proposed updated CODEOWNERS to allow better review notifications

Based on current pull request feedback and maintenance trends I'm suggesting
these rules to get the right people on the reviews by default.

Closes delta-io#1553

* fix: add support for Microsoft OneLake

This change introduces tests and support for Microsoft OneLake. This specific
commit is a rebase of the work done by our pals at Microsoft.

Co-authored-by: Mohammed Muddassir <v-mmuddassir@microsoft.com>
Co-authored-by: Christopher Watford <christopher.watford@kcftech.com>

* Ignore failing integration tests which require a special environment to operate

The OneLake support should be considered unsupported and experimental until such
time when we can add integration testing to our CI process

* Compensate for invalid log files created by Delta Live Tables

It would appear that in some cases Delta Live Tables will create a Delta table
which does not adhere to the Delta Table protocol.

The metaData action as a **required** `schemaString` property which simply
doesn't exist. Since it appears that this only exists at version zero of the
transaction log, and the _actual_ schema exists in the following versions of the
table (e.g. 1), this change introduces a default deserializer on the MetaData
action which provides a simple empty schema.

This is an alternative implementation to delta-io#1305 which is a bit more invasive and
makes our schema_string struct member `Option<String>` which I do not believe is
worth it for this unfortunate compatibility issue

Closes delta-io#1305, delta-io#1302, delta-io#1357

Sponsored-by: Databricks Inc

* chore: fix the incorrect Slack link in our readme

not sure what the deal with the go.delta.io service, no idea where that lives

Fixes delta-io#1636

* enable offset listing for s3

* Make docs.rs build docs with all features enabled

I was confused that I could not find the documentation integrating datafusion with delta-rs.

With this PR, everything should show up. Perhaps docs for a feature gated method should also mention which feature is required. Similar to what Tokio does. Perhaps it could be done in followup PRs.

* feat: expose min_commit_interval to `optimize.compact` and `optimize.z_order` (delta-io#1645)

# Description
Exposes min_commit_interval in the Python API to `optimize.compact` and
`optimize.z_order`. Added one test-case to verify the
min_commit_interval.

# Related Issue(s)
closes delta-io#1640

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* docs: add docstring to protocol method (delta-io#1660)

* fix: percent encoding of partition values and paths

* feat: handle path encoding in serde and encode partition values in file names

* fix: always unquote partition values extracted from path

* test: add tests for related issues

* fix: consistent serialization of partition values

* fix: rounbdtrip special characters

* chore: format

* fix: add feature requirement to load example

* test: add timestamp col to partitioned roundtrip tests

* test: add rust roundtip test for special characters

* fix: encode characters illegal on windows

* docs: fix some typos (delta-io#1662)

# Description
Saw two typos and marking merge in rust as half-done with a comment on
it's current limitation.

* feat: use url parsing from object store

* fix: ensure config for ms fabric

* chore: drive-by simplify test files

* fix: update aws http config key

* fix: feature gate azure update

* feat: more robust azure config handling

* fix: in memory store handling

* feat: use object-store's s3 store if copy-if-not-exists headers are specified (delta-io#1356)

* refactor: re-organize top level modules (delta-io#1434)

# Description

~This contains changes from delta-io#1432, will rebase once that's merged.~

This PR constitutes the bulk of re-organising our top level modules.
- move `DeltaTable*` structs into new `table` module
- move table configuration into `table` module
- move schema related modules into `schema` module
- rename `action` module to `protocol` - hoping to isolate everything
that can one day be the log kernel.

~It also removes the deprecated commit logic from `DeltaTable` and
updates call sites and tests accordingly.~

I am planning one more follow up, where I hope to make `transactions`
currently within `operations` a top level module. While the number of
touched files here is already massive, I want to do this in a follow up,
as it will also include some updates to the transactions itself, that
should be more carefully reviewed.

# Related Issue(s)

closes: delta-io#1136

# Documentation

<!---
Share links to useful documentation
--->

* chore: increment python library version (delta-io#1664)

# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

* fix exception string in writer.py

The exception message is ambiguous as it interchanges the table and data schemas.

* Update docs

* add read me

* Add space

* feat: allow to set large dtypes for the schema check in `write_deltalake` (delta-io#1668)

# Description
Currently it was always checking the schema for non-large types, I
didn't know before we could change it so in polars we added some schema
casting from large to non-large, this however became a problem today
when I wanted to write 200M records at once because the array was too
big the fit in normal string type.

```python
ArrowInvalid: Failed casting from large_string to string: input array too large
```

Adding this flag will allow libraries like polars to write directly with
their large dtypes in arrow. If this is merged, I can work on fix in
polars to remove the schema casting for these large types.

* fix: change partitioning schema from large to normal string for pyarrow<12 (delta-io#1671)

# Description
If pyarrow is below v12.0.0 it changes the partitioning schema fields
from large_string to string.

# Related Issue(s)
closes delta-io#1669 

# Documentation
apache/arrow#34546 (comment)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* chore: bump rust crate version

* fix: use epoch instead of ce for date stats (delta-io#1672)

# Description
date32 statistics logic was subjectively wrong. It was using
`from_num_days_from_ce_opt` which
> Makes a new NaiveDate from a day's number in the proleptic Gregorian
calendar, with January 1, 1 being day 1.

while date32 is commonly represented as days since UNIX epoch
(1970-01-01)



# Related Issue(s)
closes delta-io#1670

# Documentation
It doesn't seem like parquet actually has a spec for what a `date`
should be, but many other tools seem to use the epoch logic.

duckdb, and polars seem to use epoch instead of gregorian. 

Also arrow spec states that date32 should be epoch.

for example, if i write using polars
```py
import polars as pl

# %%
df = pl.DataFrame(
    {
        "a": [
            10561,
            9200,
            9201,
            9202,
            9203,
            9204,
            9205,
            9206,
            9207,
            9208,
            9199,
        ]
    }
)
# %%

df.select(pl.col("a").cast(pl.Date)).write_delta("./db/polars/")
```
the stats are correctly interpreted
```
{"add":{"path":"0-7b8f11ab-a259-4673-be06-9deedeec34ff-0.parquet","size":557,"partitionValues":{},"modificationTime":1695779554372,"dataChange":true,"stats":"{\"numRecords\": 11, \"minValues\": {\"a\": \"1995-03-10\"}, \"maxValues\": {\"a\": \"1998-12-01\"}, \"nullCount\": {\"a\": 0}}"}}
```

* chore: update changelog for the rust-v0.16.0 release

* Remove redundant changelog entry for 0.16

* update readme

* fix: update the delta-inspect CLI to be build again by Cargo

This sort of withered on the vine a bit, this pull request allows it to be built
properly again

* update readme

* chore: bump the version of the Rust crate

* fix: unify environment variables referenced by Databricks docs

Long-term fix will be for Databricks to release a Rust SDK for Unity 😄

Fixes delta-io#1627

* feat: support CREATE OR REPLACE

* docs: get docs.rs configured correctly again (delta-io#1693)

# Description

The docs build was changed in delta-io#1658 to compile on docs.rs with all
features, but our crate cannot compile with all-features due to the TLS
features, which are mutually exclusive.

# Related Issue(s)

For example:

- closes delta-io#1692

This has been tested locally with the following command:

```
cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental
```

* fix!: ensure predicates are parsable (delta-io#1690)

# Description
Resolves two issues that impact Datafusion implemented operators

1. When a user has an expression with a scalar built-in scalar function
we are unable parse the output predicate since the
`DummyContextProvider`'s methods are unimplemented. The provider now
uses the user provided state or a default. More work is required in the
future to allow a user provided Datafusion state to be used during the
conflict checker.

2. The string representation was not parsable by sqlparser since it was
not valid SQL. New code was written to transform an expression into a
parsable sql string. Current implementation is not exhaustive however
common use cases are covered.

The delta_datafusion.rs file is getting large so I transformed it into a
module.

This implementation makes reuse of some code from Datafusion. I've added
the Apache License at the top of the file. Let me know if any else is
required to be compliant.


# Related Issue(s)
- closes delta-io#1625

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* fix typo in readme

* fix: address formatting errors

* fix: remove an unused import

* feat(python): expose delete operation (delta-io#1687)

# Description
Naively expose the delete operation, with the option to provide a
predicate.

I first tried to expose a richer API with the Python `FilterType` and
DNF expressions, but from what I understand delta-rs doesn't implement
generic filters but only `PartitionFilter`. The `DeleteBuilder` also
only accepts datafusion expressions. So Instead of hacking my way around
or proposing a refactor I went for the simpler approach of sending a
string predicate to the rust lib.

If this implementation is OK I will add tests.

# Related Issue(s)
- closes delta-io#1417

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* docs(python): document the delete operation

* Introduce some redundant type definitions to the mypy stub

* chore: fix new clippy lints introduced in Rust 1.73

* Update the sphinx ignore for building

=_=

* Enable prebuffer

* implement issue 1169

* fix format

* feat: add version number in `.history()` and display in reversed chronological order (delta-io#1710)

# Description
Adds the version number to each commit info.

# Related Issue(s)
<!---
For example:

- closes delta-io#106 
--->
- Closes delta-io#1561
- Closes delta-io#1680

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

* feat(python): expose UPDATE operation (delta-io#1694)

# Description

- Exposes UPDATE operation to Python.
- Added two test cases, with predicate and without
- Took some learnings in simplifying the code (will apply it in MERGE PR
as well)


# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

Closes delta-io#1505

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* fix: merge operation with string predicates (delta-io#1705)

# Description
Fixes an issue when users use string predicates with the merge
operation.

Parsing a string predicate did not properly handle table references and
would always assume a bare table with a table name of the empty string.
Now the qualifier is `None` however a `DFSchema` with qualifiers can be
supplied where it makes sense.

Now users must provide source and target aliases whenever both sides
share a column name otherwise the operation will error out.

Minor refactoring of the expression parser was also done and allowed
using of case expressions.


# Related Issue(s)
- closes delta-io#1699

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* refactor!: remove a layer of lifetimes from PartitionFilter (delta-io#1725)

# Description
This commit removes a bunch of lifetime restrictions on the
`PartitionFilter` and `PartitionFilterValue` classes to make them easier
to use. While the original discussion in Slack and delta-io#1501 made mention of
using a reference type, there doesn't seem to a need for it. A
particular instance of a `PartitionFilter` is created once and just
borrowed and read for the remainder of its life.

Functions, when necessary continue to accept the non-container types
(i.e, `&str` and `&[&str]`), allowing their containerized counterparts
to continue working with them without needing to borrow or clone the
containers (i.e, `String` and `Vec<String>`).

# Related Issue(s)
- resolves delta-io#1501 

# Documentation

* feat(python): expose MERGE operation (delta-io#1685)

# Description
This exposes MERGE commands to the Python API. The updates and
predicates are first kept in the Class TableMerger and only dispatched
to Rust after `TableMerge.execute()`.

This was my first thought on how to implement it since I have limited
experience with Rust and PyO3 (still learning 😄). Maybe a more elegant
solution is that every class method on TableMerger is dispatched to Rust
and then the Rust MergeBuilder gets serialized and sent back to Python
(back and forth). Let me know your thoughts on this. If this is better,
I could also do this in the next PR, so we at least can push this one
out sooner.

Couple of issues at the moment, I need feedback on, where the first one
is blocking since I can't test it now:

~- Source_alias is not applying, somehow during a schema check the
prefix is missing, however when I printed the lines inside merge, it
showed the prefix correctly. So not sure where the issue is~
~- I had to make datafusion_utils public since I needed to get the
Expression Struct from it, is this the right way to do that? @Blajda~

Edit:
I will pull @Blajda's changes
delta-io#1705 once merged with develop:


# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->
closes  delta-io#1357

* chore: remove deprecated functions

* chore: bump the python package version (delta-io#1734)

# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

* fix: reorder encode_partition_value() checks and add tests (delta-io#1733)

# Description
The `isinstance(val, datetime)` check was after `isinstance(val, date)`
which meant that it was never found. I added a test for each encoding
type.

---------

Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>

* Relax `pyarrow` pin

* fix: remove `pandas` pin (delta-io#1746)

# Description

Removes the `pandas` pin.

# Related Issue(s)

Resolves delta-io#1745

* docs: get docs.rs configured correctly again (delta-io#1693)

# Description

The docs build was changed in delta-io#1658 to compile on docs.rs with all
features, but our crate cannot compile with all-features due to the TLS
features, which are mutually exclusive.

# Related Issue(s)

For example:

- closes delta-io#1692

This has been tested locally with the following command:

```
cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental
```

* Make this a patch release to fix docs.rs

* Remove the hdfs feature from the docsrs build

* refactor!: update operations to use delta scan (delta-io#1639)

# Description
Recently implemented operations did not use `DeltaScan` it had some
gaps. These gaps would make it harder switch towards logical plans which
is required for merge.

Gaps:
- It was not possible to include file lineage in the result
- The subset of files to be scanned is known ahead of time. Users had to
reconstruct a parquet scan based on those files

The PR introduces a `DeltaScanBuilder` that allow users to specify which
files to use when constructing the scan, if the scan should be enhanced
to include additional metadata columns, and allows a projection to be
specified. It also retains previous functionality of pruning based on
the provided filter when files to scan are not provided.

`DeltaScanConfig` is also introduced which allows users to deterministic
obtain the names of any added metadata columns or allows them to specify
the name if required.

The public interface for `find_files` has changed but functionality
remains the same.

A new table provider was introduced which accepts an `DeltaScanConfig`.
This is required for future merge enhancements so unmodified files can
be pruned pruned prior to writes.

---------

Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>

* chore: update datafusion (delta-io#1741)

Updates arrow and datafusion dependencies to latest.

* docs: convert docs to use mkdocs (delta-io#1731)

# Description
Completed the outstanding tasks in delta-io#1708

Also changed theme from readthedocs to mkdocs - both are built-in but
latter looks sleeker

# Related Issue(s)
closes delta-io#1708

---------

Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

* docs: dynamodb lock configuration (delta-io#1752)

# Description
I have added documentation in the API and also on the Python usage page
regarding this configuration. Please let me know if it is satisfactory,
and if not, I am more than happy to address any issues or make any
necessary adjustments.

# Related Issue(s)
- closes delta-io#1674

# Documentation

* feat: ignore binary columns for stats generation

* feat: honor appendOnly table config (delta-io#1747)

# Description
Throw an error if a transaction includes Remove action with data change
but the Delta Table is append-only.

# Related Issue(s)
- closes delta-io#352

* chore: fix building/running tests without the datafusion feature

This looks like an oversight that our CI didn't test because we have the
datafusion feature typically enabled for our tests. The build error would only
show up when building tests without it.

* add write support explicitly for pyarrow dataset

* feat(python): expose FSCK (repair) operation  (delta-io#1730)

# Description
This PR exposes the FSCK operation as a `repair` method under the
`DeltaTable `class.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->
- closes delta-io#1727

---------

Co-authored-by: Will Jones <willjones127@gmail.com>

* refactor: perform bulk deletes during metadata cleanup

In addition to doing bulk deletes, I removed what seems like (at least to me)
unnecessary code. At it's core, files are considered up for deletion
when their last_modified time is older than the cutoff time AND the version
if less than the specific version (usually the latest version).

* Make an attempt at improving the utilization of delete_stream for cleaning up expired logs

This change builds on @cmackenzie1's work and feeds the list stream directly into
the delete_stream with a predicate function to identify paths for deletion

* start to add vacuum into transaction log

* add vacuum operations in transaction log

* attempt to calculate size

* add test

* chore: bump Python package version

* fix: ignore inf in stats

* doc(README): remove typo

* enhance docs to enable multi-lingual examples

* use official Python API for references

* chore: refactor into the deltalake meta crate and deltalake-core crates

This puts the groundwork in place for starting to partition into smaller crates
in a simpler and more manageable fashion.

See delta-io#1713

* Correct the working directory for the parquet2 tests

* feat: add deltalake sql crate (delta-io#1757)

# Description

This is an fairly early draft to create logical plans from sql using the
datafusion abstractions. Adopted the patterns over there quite closely
since the ultimate goal would be to ask the datafusion community if they
would accept these changes within the datafusion sql crate ...

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>

* rollback resolve bucket region change

---------

Co-authored-by: Robert Pack <robstar.pack@gmail.com>
Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>
Co-authored-by: nohajc <nohajc@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Co-authored-by: Denny Lee <denny.g.lee@gmail.com>
Co-authored-by: QP Hou <dave2008713@gmail.com>
Co-authored-by: haruband <haruband@gmail.com>
Co-authored-by: Ben Magee <ben@bmagee.com>
Co-authored-by: Constantin S. Pan <kvapen@gmail.com>
Co-authored-by: Eero Lihavainen <eero.lihavainen@nitor.com>
Co-authored-by: Mohammed Muddassir <v-mmuddassir@microsoft.com>
Co-authored-by: Christopher Watford <christopher.watford@kcftech.com>
Co-authored-by: Simon Vandel Sillesen <simon.vandel@gmail.com>
Co-authored-by: Ion Koutsouris <ioncjk@gmail.com>
Co-authored-by: Matthew Powers <matthewkevinpowers@gmail.com>
Co-authored-by: Sébastien Diemer <diemersebastien@yahoo.fr>
Co-authored-by: Cory Grinstead <universalmind.candy@gmail.com>
Co-authored-by: Trinity Xia <trinityx@trinityacstudio.lan>
Co-authored-by: hnaoto <hnaoto@me.com>
Co-authored-by: universalmind303 <cory.grinstead@gmail.com>
Co-authored-by: David Blajda <db@davidblajda.com>
Co-authored-by: Josiah Parry <josiah.parry@gmail.com>
Co-authored-by: Guilhem de Viry <gdeviry@mytraffic.fr>
Co-authored-by: Nikolay Ulmasov <ulmasov@hotmail.com>
Co-authored-by: Cole Mackenzie <cole@cloudflare.com>
Co-authored-by: ldacey <lance.dacey@gmail.com>
Co-authored-by: Dave Hirschfeld <dave.hirschfeld@gmail.com>
Co-authored-by: David Blajda <blajda@hotmail.com>
Co-authored-by: Brayan Jules <brayanjuls@users.noreply.github.com>
Co-authored-by: emcake <3726783+emcake@users.noreply.github.com>
Co-authored-by: Junjun Dong <junjun.dong9@gmail.com>
Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Co-authored-by: Deep145757 <146447579+Deep145757@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants