-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement delta log checkpointing #106
Labels
binding/rust
Issues for the Rust crate
enhancement
New feature or request
help wanted
Extra attention is needed
Comments
houqp
added
binding/rust
Issues for the Rust crate
enhancement
New feature or request
help wanted
Extra attention is needed
labels
Mar 3, 2021
Merged
bbigras
added a commit
to bbigras/delta-rs
that referenced
this issue
Sep 30, 2021
houqp
pushed a commit
that referenced
this issue
Sep 30, 2021
wjones127
added a commit
that referenced
this issue
Sep 30, 2022
# Description This is the published version of the google doc https://docs.google.com/document/d/1iKVdnilMCS6qziBBVohXw2-UX_Ld2K6AAjLTZHcYhQ0/edit#heading=h.iezbyetoxtf # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
fvaleye
pushed a commit
that referenced
this issue
Nov 16, 2022
# Description Update datafusion and arrow dependecies to latest versions. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
wjones127
pushed a commit
that referenced
this issue
Nov 17, 2022
) # Description When passing `parquet_read_options` to `to_pyarrow_dataset` it is now possible to use `dictionary_columns` to control which columns should be dictionary encoded as they are read. # Related Issue(s) - closes #938 <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
roeap
added a commit
that referenced
this issue
Nov 17, 2022
# Description This PR builds in top of the changes to handling the runtime in #933. In my local tests this fixed #915. Additionally, I added the runtime as a property on the fs handler to avoid re-creating it on every call. In some non-representative tests with a large number of very small partitions it cut the runtime in about half. cc @wjones127 # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
roeap
added a commit
that referenced
this issue
Nov 22, 2022
# Description Adding a simple example how to use operations APIs to create and work with delta tables and some minor documentation tweaks. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation ---> Co-authored-by: Will Jones <willjones127@gmail.com>
roeap
added a commit
that referenced
this issue
Dec 3, 2022
# Description This PR integrates the `DeltaDataChecker` in the write path of the operations API. I was unsure whether to integrate this in the (experimental) writer in the operations module, but opted for keeping the writer itself focussed on the lower level write operations. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
wjones127
pushed a commit
that referenced
this issue
Dec 9, 2022
# Description This loosen the version requirement for maturin when building the python package. # Related Issue(s) - closes #1004 <!--- For example: - closes #106 ---> # Documentation Using an exact version can be excessively restrictive, for example preventing the use of the latest (but still compatible) version of maturin. Especially when packaging for rolling-releases distributions (like Archlinux), being able to build with the latest version would be beneficial. <!--- Share links to useful documentation --->
wjones127
pushed a commit
that referenced
this issue
Dec 19, 2022
wjones127
pushed a commit
that referenced
this issue
Dec 19, 2022
# Description The description of the main changes of your pull request Use [labeler](https://github.com/actions/labeler) to automatically label PRs. There are language labels, CI, documentation as well as crate-specific labels. # Related Issue(s) <!--- For example: - closes #106 ---> Closes #1026 & unblocks #997. # Documentation <!--- Share links to useful documentation --->
wjones127
added a commit
that referenced
this issue
Dec 21, 2022
# Description Fix lint issues from latest cargo release. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
wjones127
added a commit
that referenced
this issue
Dec 21, 2022
…979) # Description This is intended to fix the issue where the Python CI job is frequently failing. I believe it's because the Docker services haven't fully started up before the tests start running. To address this, I added a function to wait for the services to be responsive. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
fvaleye
pushed a commit
that referenced
this issue
Dec 27, 2022
# Description This PR consolidates the four methods `files()`, `file_paths()`, `file_uris()`, `files_by_partitions()` into just two methods: * `files()` -> which returns paths as they are in the Delta Log (usually relative, but *can* be absolute, particularly if they are located outside of the delta table root). * `file_uris()`, which returns absolute URIs for all files. Both of these now take the `partition_filters` parameter, making `files_by_partitions()` obsolete. That latter function has been marked deprecated, but it also returns it to its original behavior of returning absolute file paths and not relative ones, resolving #894. Finally, the `partition_filters` parameter now supports passing values other than strings, such as integers and floats. TODO: * [x] Update documentation * [ ] ~~Test behavior of filtering for null or non-null~~ Null handling isn't supported by DNF filters IIUC * [x] Test behavior of paths on object stores. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
roeap
added a commit
that referenced
this issue
Jan 5, 2023
# Description Moving the `vacuum` operation into the operations module and adopting `IntoFuture` for the command builder. This is breaking the APIs for the builder (now with consistent setter names) but we are able to keep the APIs for `DeltaTable` in rust and python. In a follow up I would like to move th optimize command as well, This however may require refactoring the `PartitionValue` since we can only deal with `static` lifetimes when using `IntoFuture`, A while back we talked about pulling in `ScalarValue` from datafusion to optimize that implementation and maybe that's a good opportunitiy to look into that as well. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation ---> Co-authored-by: Will Jones <willjones127@gmail.com>
wjones127
added a commit
that referenced
this issue
Jan 17, 2023
# Description Considering adding continuous benchmarks to Python reader / writer. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
wjones127
added a commit
that referenced
this issue
Jan 26, 2023
# Description In preparation for new release. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
roeap
added a commit
that referenced
this issue
Jan 27, 2023
# Description The latest rust release comes with new more opinionated clippy :). This PR fixes the new clippy errors and and runs `cargo clippy --fix` on all our crates. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
wjones127
pushed a commit
that referenced
this issue
Jan 28, 2023
Signed-off-by: Marijn Valk <marijncv@hotmail.com> # Description Adds a test for the `left_larger_than_right` function and rewrites the function match expression to match on both the `left` and `right` argument # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation ---> --------- Signed-off-by: Marijn Valk <marijncv@hotmail.com>
roeap
added a commit
that referenced
this issue
Feb 1, 2023
# Description A simple maintenance PR to update datafusion to the latest version. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
wjones127
added a commit
that referenced
this issue
Oct 30, 2023
roeap
added a commit
that referenced
this issue
Oct 31, 2023
# Description This is an fairly early draft to create logical plans from sql using the datafusion abstractions. Adopted the patterns over there quite closely since the ultimate goal would be to ask the datafusion community if they would accept these changes within the datafusion sql crate ... # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
ryanaston
pushed a commit
to segmentio/delta-rs
that referenced
this issue
Nov 1, 2023
* feat: extend unit catalog support * chore: draft datafusion integration * fix: allow passing catalog options from python * chore: clippy * feat: add more azure credentials * fix: add defaults for return types * fix: simpler defaults * Update rust/src/data_catalog/unity/mod.rs Co-authored-by: nohajc <nohajc@gmail.com> * fix: imports * fix: add some defaults * test: add failing provider test * feat: list catalogs * merge main * fix: remove artifact * fix: errors after merge with main * Start python api docs * docs: update Readme (delta-io#1440) # Description With summit coming up I thought we might update our README, since delta-rs has evolved quite a bit since the README was first written... Just opening the Draft to get feedback on the general "patterns" i.e. how the tables are formatted, how detailed we want to show the features and mostly the looks of the header. Also hoping our community experts may have some content they wat to add here 😆. cc @dennyglee @MrPowers @wjones127 @rtyler @houqp @fvaleye --------- Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: R. Tyler Croy <rtyler@brokenco.de> * Pin chrono to 0.4.30 v0.4.31 was just released which introduces some spurious deprecation warnings * docs: update Readme (delta-io#1633) # Description - Changed the icons as, at first glance, it looked like AWS was not supported (in blue), while the green open icon looked like it was completed - Added one line linking to the Delta Lake docker - Fixed some minor grammar issues Including community experts @roeap @MrPowers @wjones127 @rtyler @houqp @fvaleye to ensure these updates make sense. Thanks! * chore: update datafusion to 31, arrow to 46 and object_store to 0.7 (delta-io#1634) # Description Update datafusion to 31 * chore: relax chrono pin to 0.4 (delta-io#1635) # Description relax chrono pin to improve downstream compatibility. * make create_checkpoint_for public * add documentation to create_checkpoint_for * Implement parsing for the new `domainMetadata` actions in the commit log The Delta Lake protocol which will be released in conjunction with "3.0.0" (currently at RC1) introduces `domainMetadata` actions to the commit log to enable system or user-provided metadata about the commits to be added to the log. With DBR 13.3 in the Databricks ecosystem, tables are already being written with this action via the "liquid clustering" feature. This change enables the clean reading of these tables, but at present nothing novel is done with this information. [Read more here](https://www.databricks.com/blog/announcing-delta-lake-30-new-universal-format-and-liquid-clustering) Fixes delta-io#1626 Sponsored-by: Databricks Inc * fix: include in-progress row group when calculating in-memory buffer length (delta-io#1638) # Description `PartitionWriter.buffer_len()` is documented as returning: > the current byte length of the in memory buffer. However, this doesn't currently include the length of the in-progress row group. This means that until a row group is flushed, `buffer_len()` returns `0`. Based on the documented description, its length should probably include the bytes currently in-memory as part of an unflushed row group. `buffered_record_batch_count` _does_ include in-progress row groups, so this change also means record count and buffered bytes are reported consistently. # Related Issue(s) <!--- For example: - closes delta-io#106 ---> - closes delta-io#1637 # Documentation <!--- Share links to useful documentation ---> [`buffer_len` on `RecordBatchWriter`](https://docs.rs/deltalake/0.15.0/deltalake/writer/record_batch/struct.RecordBatchWriter.html#method.buffer_len) --------- Co-authored-by: Will Jones <willjones127@gmail.com> * feat: allow multiple incremental commits in optimize Currently "optimize" executes the whole plan in one commit, which might fail. The larger the table, the more likely it is to fail and the more expensive the failure is. Add an option in OptimizeBuilder that allows specifying a commit interval. If that is provided, the plan executor will periodically commit the accumulated actions. * fix: explicitly require chrono 0.4.31 or greater The Python binding relies on `timestamp_nanos)opt()` which requires 0.4.31 or greater from chroni since it did not previously exist. As a [cargo dependency refresher](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies-from-cratesio) this version range is >=0.4.31, < 0.5.0 which is I believe what we need for optimal downstream compatibility. * Correct some merge related errors with redundant package names from the workspace * Address some latent clippy failures after merging main * Correct the incorrect documentation for `Backoff` * fix: avoid excess listing of log files * feat: pass known file sizes to filesystem in Python (delta-io#1630) # Description Currently the Filesystem implementation always makes a HEAD request when opening a file, to determine the file size. The proposed change is to read the file sizes from the delta log instead, and to pass them down to the `open_input_file` call, eliminating the HEAD request. # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> * Proposed updated CODEOWNERS to allow better review notifications Based on current pull request feedback and maintenance trends I'm suggesting these rules to get the right people on the reviews by default. Closes delta-io#1553 * fix: add support for Microsoft OneLake This change introduces tests and support for Microsoft OneLake. This specific commit is a rebase of the work done by our pals at Microsoft. Co-authored-by: Mohammed Muddassir <v-mmuddassir@microsoft.com> Co-authored-by: Christopher Watford <christopher.watford@kcftech.com> * Ignore failing integration tests which require a special environment to operate The OneLake support should be considered unsupported and experimental until such time when we can add integration testing to our CI process * Compensate for invalid log files created by Delta Live Tables It would appear that in some cases Delta Live Tables will create a Delta table which does not adhere to the Delta Table protocol. The metaData action as a **required** `schemaString` property which simply doesn't exist. Since it appears that this only exists at version zero of the transaction log, and the _actual_ schema exists in the following versions of the table (e.g. 1), this change introduces a default deserializer on the MetaData action which provides a simple empty schema. This is an alternative implementation to delta-io#1305 which is a bit more invasive and makes our schema_string struct member `Option<String>` which I do not believe is worth it for this unfortunate compatibility issue Closes delta-io#1305, delta-io#1302, delta-io#1357 Sponsored-by: Databricks Inc * chore: fix the incorrect Slack link in our readme not sure what the deal with the go.delta.io service, no idea where that lives Fixes delta-io#1636 * enable offset listing for s3 * Make docs.rs build docs with all features enabled I was confused that I could not find the documentation integrating datafusion with delta-rs. With this PR, everything should show up. Perhaps docs for a feature gated method should also mention which feature is required. Similar to what Tokio does. Perhaps it could be done in followup PRs. * feat: expose min_commit_interval to `optimize.compact` and `optimize.z_order` (delta-io#1645) # Description Exposes min_commit_interval in the Python API to `optimize.compact` and `optimize.z_order`. Added one test-case to verify the min_commit_interval. # Related Issue(s) closes delta-io#1640 --------- Co-authored-by: Will Jones <willjones127@gmail.com> * docs: add docstring to protocol method (delta-io#1660) * fix: percent encoding of partition values and paths * feat: handle path encoding in serde and encode partition values in file names * fix: always unquote partition values extracted from path * test: add tests for related issues * fix: consistent serialization of partition values * fix: rounbdtrip special characters * chore: format * fix: add feature requirement to load example * test: add timestamp col to partitioned roundtrip tests * test: add rust roundtip test for special characters * fix: encode characters illegal on windows * docs: fix some typos (delta-io#1662) # Description Saw two typos and marking merge in rust as half-done with a comment on it's current limitation. * feat: use url parsing from object store * fix: ensure config for ms fabric * chore: drive-by simplify test files * fix: update aws http config key * fix: feature gate azure update * feat: more robust azure config handling * fix: in memory store handling * feat: use object-store's s3 store if copy-if-not-exists headers are specified (delta-io#1356) * refactor: re-organize top level modules (delta-io#1434) # Description ~This contains changes from delta-io#1432, will rebase once that's merged.~ This PR constitutes the bulk of re-organising our top level modules. - move `DeltaTable*` structs into new `table` module - move table configuration into `table` module - move schema related modules into `schema` module - rename `action` module to `protocol` - hoping to isolate everything that can one day be the log kernel. ~It also removes the deprecated commit logic from `DeltaTable` and updates call sites and tests accordingly.~ I am planning one more follow up, where I hope to make `transactions` currently within `operations` a top level module. While the number of touched files here is already massive, I want to do this in a follow up, as it will also include some updates to the transactions itself, that should be more carefully reviewed. # Related Issue(s) closes: delta-io#1136 # Documentation <!--- Share links to useful documentation ---> * chore: increment python library version (delta-io#1664) # Description The description of the main changes of your pull request # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> * fix exception string in writer.py The exception message is ambiguous as it interchanges the table and data schemas. * Update docs * add read me * Add space * feat: allow to set large dtypes for the schema check in `write_deltalake` (delta-io#1668) # Description Currently it was always checking the schema for non-large types, I didn't know before we could change it so in polars we added some schema casting from large to non-large, this however became a problem today when I wanted to write 200M records at once because the array was too big the fit in normal string type. ```python ArrowInvalid: Failed casting from large_string to string: input array too large ``` Adding this flag will allow libraries like polars to write directly with their large dtypes in arrow. If this is merged, I can work on fix in polars to remove the schema casting for these large types. * fix: change partitioning schema from large to normal string for pyarrow<12 (delta-io#1671) # Description If pyarrow is below v12.0.0 it changes the partitioning schema fields from large_string to string. # Related Issue(s) closes delta-io#1669 # Documentation apache/arrow#34546 (comment) --------- Co-authored-by: Will Jones <willjones127@gmail.com> * chore: bump rust crate version * fix: use epoch instead of ce for date stats (delta-io#1672) # Description date32 statistics logic was subjectively wrong. It was using `from_num_days_from_ce_opt` which > Makes a new NaiveDate from a day's number in the proleptic Gregorian calendar, with January 1, 1 being day 1. while date32 is commonly represented as days since UNIX epoch (1970-01-01) # Related Issue(s) closes delta-io#1670 # Documentation It doesn't seem like parquet actually has a spec for what a `date` should be, but many other tools seem to use the epoch logic. duckdb, and polars seem to use epoch instead of gregorian. Also arrow spec states that date32 should be epoch. for example, if i write using polars ```py import polars as pl # %% df = pl.DataFrame( { "a": [ 10561, 9200, 9201, 9202, 9203, 9204, 9205, 9206, 9207, 9208, 9199, ] } ) # %% df.select(pl.col("a").cast(pl.Date)).write_delta("./db/polars/") ``` the stats are correctly interpreted ``` {"add":{"path":"0-7b8f11ab-a259-4673-be06-9deedeec34ff-0.parquet","size":557,"partitionValues":{},"modificationTime":1695779554372,"dataChange":true,"stats":"{\"numRecords\": 11, \"minValues\": {\"a\": \"1995-03-10\"}, \"maxValues\": {\"a\": \"1998-12-01\"}, \"nullCount\": {\"a\": 0}}"}} ``` * chore: update changelog for the rust-v0.16.0 release * Remove redundant changelog entry for 0.16 * update readme * fix: update the delta-inspect CLI to be build again by Cargo This sort of withered on the vine a bit, this pull request allows it to be built properly again * update readme * chore: bump the version of the Rust crate * fix: unify environment variables referenced by Databricks docs Long-term fix will be for Databricks to release a Rust SDK for Unity 😄 Fixes delta-io#1627 * feat: support CREATE OR REPLACE * docs: get docs.rs configured correctly again (delta-io#1693) # Description The docs build was changed in delta-io#1658 to compile on docs.rs with all features, but our crate cannot compile with all-features due to the TLS features, which are mutually exclusive. # Related Issue(s) For example: - closes delta-io#1692 This has been tested locally with the following command: ``` cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental ``` * fix!: ensure predicates are parsable (delta-io#1690) # Description Resolves two issues that impact Datafusion implemented operators 1. When a user has an expression with a scalar built-in scalar function we are unable parse the output predicate since the `DummyContextProvider`'s methods are unimplemented. The provider now uses the user provided state or a default. More work is required in the future to allow a user provided Datafusion state to be used during the conflict checker. 2. The string representation was not parsable by sqlparser since it was not valid SQL. New code was written to transform an expression into a parsable sql string. Current implementation is not exhaustive however common use cases are covered. The delta_datafusion.rs file is getting large so I transformed it into a module. This implementation makes reuse of some code from Datafusion. I've added the Apache License at the top of the file. Let me know if any else is required to be compliant. # Related Issue(s) - closes delta-io#1625 --------- Co-authored-by: Will Jones <willjones127@gmail.com> * fix typo in readme * fix: address formatting errors * fix: remove an unused import * feat(python): expose delete operation (delta-io#1687) # Description Naively expose the delete operation, with the option to provide a predicate. I first tried to expose a richer API with the Python `FilterType` and DNF expressions, but from what I understand delta-rs doesn't implement generic filters but only `PartitionFilter`. The `DeleteBuilder` also only accepts datafusion expressions. So Instead of hacking my way around or proposing a refactor I went for the simpler approach of sending a string predicate to the rust lib. If this implementation is OK I will add tests. # Related Issue(s) - closes delta-io#1417 --------- Co-authored-by: Will Jones <willjones127@gmail.com> * docs(python): document the delete operation * Introduce some redundant type definitions to the mypy stub * chore: fix new clippy lints introduced in Rust 1.73 * Update the sphinx ignore for building =_= * Enable prebuffer * implement issue 1169 * fix format * feat: add version number in `.history()` and display in reversed chronological order (delta-io#1710) # Description Adds the version number to each commit info. # Related Issue(s) <!--- For example: - closes delta-io#106 ---> - Closes delta-io#1561 - Closes delta-io#1680 --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de> * feat(python): expose UPDATE operation (delta-io#1694) # Description - Exposes UPDATE operation to Python. - Added two test cases, with predicate and without - Took some learnings in simplifying the code (will apply it in MERGE PR as well) # Related Issue(s) <!--- For example: - closes delta-io#106 ---> Closes delta-io#1505 --------- Co-authored-by: Will Jones <willjones127@gmail.com> * fix: merge operation with string predicates (delta-io#1705) # Description Fixes an issue when users use string predicates with the merge operation. Parsing a string predicate did not properly handle table references and would always assume a bare table with a table name of the empty string. Now the qualifier is `None` however a `DFSchema` with qualifiers can be supplied where it makes sense. Now users must provide source and target aliases whenever both sides share a column name otherwise the operation will error out. Minor refactoring of the expression parser was also done and allowed using of case expressions. # Related Issue(s) - closes delta-io#1699 --------- Co-authored-by: Will Jones <willjones127@gmail.com> * refactor!: remove a layer of lifetimes from PartitionFilter (delta-io#1725) # Description This commit removes a bunch of lifetime restrictions on the `PartitionFilter` and `PartitionFilterValue` classes to make them easier to use. While the original discussion in Slack and delta-io#1501 made mention of using a reference type, there doesn't seem to a need for it. A particular instance of a `PartitionFilter` is created once and just borrowed and read for the remainder of its life. Functions, when necessary continue to accept the non-container types (i.e, `&str` and `&[&str]`), allowing their containerized counterparts to continue working with them without needing to borrow or clone the containers (i.e, `String` and `Vec<String>`). # Related Issue(s) - resolves delta-io#1501 # Documentation * feat(python): expose MERGE operation (delta-io#1685) # Description This exposes MERGE commands to the Python API. The updates and predicates are first kept in the Class TableMerger and only dispatched to Rust after `TableMerge.execute()`. This was my first thought on how to implement it since I have limited experience with Rust and PyO3 (still learning 😄). Maybe a more elegant solution is that every class method on TableMerger is dispatched to Rust and then the Rust MergeBuilder gets serialized and sent back to Python (back and forth). Let me know your thoughts on this. If this is better, I could also do this in the next PR, so we at least can push this one out sooner. Couple of issues at the moment, I need feedback on, where the first one is blocking since I can't test it now: ~- Source_alias is not applying, somehow during a schema check the prefix is missing, however when I printed the lines inside merge, it showed the prefix correctly. So not sure where the issue is~ ~- I had to make datafusion_utils public since I needed to get the Expression Struct from it, is this the right way to do that? @Blajda~ Edit: I will pull @Blajda's changes delta-io#1705 once merged with develop: # Related Issue(s) <!--- For example: - closes delta-io#106 ---> closes delta-io#1357 * chore: remove deprecated functions * chore: bump the python package version (delta-io#1734) # Description The description of the main changes of your pull request # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> * fix: reorder encode_partition_value() checks and add tests (delta-io#1733) # Description The `isinstance(val, datetime)` check was after `isinstance(val, date)` which meant that it was never found. I added a test for each encoding type. --------- Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com> * Relax `pyarrow` pin * fix: remove `pandas` pin (delta-io#1746) # Description Removes the `pandas` pin. # Related Issue(s) Resolves delta-io#1745 * docs: get docs.rs configured correctly again (delta-io#1693) # Description The docs build was changed in delta-io#1658 to compile on docs.rs with all features, but our crate cannot compile with all-features due to the TLS features, which are mutually exclusive. # Related Issue(s) For example: - closes delta-io#1692 This has been tested locally with the following command: ``` cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental ``` * Make this a patch release to fix docs.rs * Remove the hdfs feature from the docsrs build * refactor!: update operations to use delta scan (delta-io#1639) # Description Recently implemented operations did not use `DeltaScan` it had some gaps. These gaps would make it harder switch towards logical plans which is required for merge. Gaps: - It was not possible to include file lineage in the result - The subset of files to be scanned is known ahead of time. Users had to reconstruct a parquet scan based on those files The PR introduces a `DeltaScanBuilder` that allow users to specify which files to use when constructing the scan, if the scan should be enhanced to include additional metadata columns, and allows a projection to be specified. It also retains previous functionality of pruning based on the provided filter when files to scan are not provided. `DeltaScanConfig` is also introduced which allows users to deterministic obtain the names of any added metadata columns or allows them to specify the name if required. The public interface for `find_files` has changed but functionality remains the same. A new table provider was introduced which accepts an `DeltaScanConfig`. This is required for future merge enhancements so unmodified files can be pruned pruned prior to writes. --------- Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com> * chore: update datafusion (delta-io#1741) Updates arrow and datafusion dependencies to latest. * docs: convert docs to use mkdocs (delta-io#1731) # Description Completed the outstanding tasks in delta-io#1708 Also changed theme from readthedocs to mkdocs - both are built-in but latter looks sleeker # Related Issue(s) closes delta-io#1708 --------- Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com> Co-authored-by: R. Tyler Croy <rtyler@brokenco.de> * docs: dynamodb lock configuration (delta-io#1752) # Description I have added documentation in the API and also on the Python usage page regarding this configuration. Please let me know if it is satisfactory, and if not, I am more than happy to address any issues or make any necessary adjustments. # Related Issue(s) - closes delta-io#1674 # Documentation * feat: ignore binary columns for stats generation * feat: honor appendOnly table config (delta-io#1747) # Description Throw an error if a transaction includes Remove action with data change but the Delta Table is append-only. # Related Issue(s) - closes delta-io#352 * chore: fix building/running tests without the datafusion feature This looks like an oversight that our CI didn't test because we have the datafusion feature typically enabled for our tests. The build error would only show up when building tests without it. * add write support explicitly for pyarrow dataset * feat(python): expose FSCK (repair) operation (delta-io#1730) # Description This PR exposes the FSCK operation as a `repair` method under the `DeltaTable `class. # Related Issue(s) <!--- For example: - closes delta-io#106 ---> - closes delta-io#1727 --------- Co-authored-by: Will Jones <willjones127@gmail.com> * refactor: perform bulk deletes during metadata cleanup In addition to doing bulk deletes, I removed what seems like (at least to me) unnecessary code. At it's core, files are considered up for deletion when their last_modified time is older than the cutoff time AND the version if less than the specific version (usually the latest version). * Make an attempt at improving the utilization of delete_stream for cleaning up expired logs This change builds on @cmackenzie1's work and feeds the list stream directly into the delete_stream with a predicate function to identify paths for deletion * start to add vacuum into transaction log * add vacuum operations in transaction log * attempt to calculate size * add test * chore: bump Python package version * fix: ignore inf in stats * doc(README): remove typo * enhance docs to enable multi-lingual examples * use official Python API for references * chore: refactor into the deltalake meta crate and deltalake-core crates This puts the groundwork in place for starting to partition into smaller crates in a simpler and more manageable fashion. See delta-io#1713 * Correct the working directory for the parquet2 tests * feat: add deltalake sql crate (delta-io#1757) # Description This is an fairly early draft to create logical plans from sql using the datafusion abstractions. Adopted the patterns over there quite closely since the ultimate goal would be to ask the datafusion community if they would accept these changes within the datafusion sql crate ... # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de> * rollback resolve bucket region change --------- Co-authored-by: Robert Pack <robstar.pack@gmail.com> Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com> Co-authored-by: nohajc <nohajc@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: R. Tyler Croy <rtyler@brokenco.de> Co-authored-by: Denny Lee <denny.g.lee@gmail.com> Co-authored-by: QP Hou <dave2008713@gmail.com> Co-authored-by: haruband <haruband@gmail.com> Co-authored-by: Ben Magee <ben@bmagee.com> Co-authored-by: Constantin S. Pan <kvapen@gmail.com> Co-authored-by: Eero Lihavainen <eero.lihavainen@nitor.com> Co-authored-by: Mohammed Muddassir <v-mmuddassir@microsoft.com> Co-authored-by: Christopher Watford <christopher.watford@kcftech.com> Co-authored-by: Simon Vandel Sillesen <simon.vandel@gmail.com> Co-authored-by: Ion Koutsouris <ioncjk@gmail.com> Co-authored-by: Matthew Powers <matthewkevinpowers@gmail.com> Co-authored-by: Sébastien Diemer <diemersebastien@yahoo.fr> Co-authored-by: Cory Grinstead <universalmind.candy@gmail.com> Co-authored-by: Trinity Xia <trinityx@trinityacstudio.lan> Co-authored-by: hnaoto <hnaoto@me.com> Co-authored-by: universalmind303 <cory.grinstead@gmail.com> Co-authored-by: David Blajda <db@davidblajda.com> Co-authored-by: Josiah Parry <josiah.parry@gmail.com> Co-authored-by: Guilhem de Viry <gdeviry@mytraffic.fr> Co-authored-by: Nikolay Ulmasov <ulmasov@hotmail.com> Co-authored-by: Cole Mackenzie <cole@cloudflare.com> Co-authored-by: ldacey <lance.dacey@gmail.com> Co-authored-by: Dave Hirschfeld <dave.hirschfeld@gmail.com> Co-authored-by: David Blajda <blajda@hotmail.com> Co-authored-by: Brayan Jules <brayanjuls@users.noreply.github.com> Co-authored-by: emcake <3726783+emcake@users.noreply.github.com> Co-authored-by: Junjun Dong <junjun.dong9@gmail.com> Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Co-authored-by: Deep145757 <146447579+Deep145757@users.noreply.github.com>
wjones127
added a commit
that referenced
this issue
Nov 4, 2023
…date()` (#1749) # Description A user can now add a new_values dictionary that contains python objects as a value. Some weird behavior's I noticed, probably related to datafusion, updating a timestamp column has to be done by providing a unix timestamp in microseconds. I personally find this very confusing, I was expecting to be able to pass "2012-10-01" for example in the updates. Another weird behaviour is with list of string columns. I can pass `{"list_of_string_col":"[1,2,3]"}` or `{"list_of_string_col":"['1','2','3']"}` and both will work. I expect the first one to raise an exception on invalid datatypes. Combined datatypes `"[1,2,'3']"` luckily do raise an error by datafusion. # Related Issue(s) <!--- For example: - closes #106 ---> - closes #1740 --------- Co-authored-by: Will Jones <willjones127@gmail.com>
wjones127
added a commit
that referenced
this issue
Nov 4, 2023
# Description You can now also do multiple when clauses just like in Rust and PySpark. I added one test for now 😄, will add more later when I have some time. I'll update the docs in another PR to reflect the possibility of this behavior. # Related Issue(s) <!--- For example: - closes #106 ---> - closes #1736 --------- Co-authored-by: Will Jones <willjones127@gmail.com>
wjones127
added a commit
that referenced
this issue
Nov 5, 2023
# Description I build on top of the branch of @wjones127 #1602. In pyarrow v13+ the ParquetWriter by default uses the `compliant_nested_types = True` (see related PR: https://github.com/apache/arrow/pull/35146/files)and the docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html). In arrow/parquet-rs it fails when it compares schemas because it expected the old non-compliant ones. For now we can have pyarrow 13+ supported by disabling it or updating the file options provided by a user. # Related Issue(s) <!--- For example: - closes #106 ---> - Closes #1744 # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
wjones127
added a commit
that referenced
this issue
Nov 6, 2023
# Description Yet again, one of the linux release builds broke. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
roeap
added a commit
that referenced
this issue
Nov 6, 2023
# Description ~~this PR depends on #1741.~~ Migrating the implementation of actions and schema over from kernel. The schema is much more complete in terms of the more recent delta features and more rigorously leverages the rust type system. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
roeap
pushed a commit
that referenced
this issue
Nov 12, 2023
ion-elgreco
pushed a commit
that referenced
this issue
Dec 2, 2023
# Description Prepare for next release # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
Jan-Schweizer
pushed a commit
to Jan-Schweizer/delta-rs
that referenced
this issue
Dec 2, 2023
# Description Prepare for next release # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation --->
ion-elgreco
added a commit
that referenced
this issue
Dec 5, 2023
# Description Latest Python release had a bunch of failures: https://github.com/delta-io/delta-rs/actions/runs/7095801050 Also doing some general cleanup. TODO: * [x] Figure out why Linux job failed # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
ion-elgreco
added a commit
that referenced
this issue
Jan 16, 2024
…s, to make sure tombstone and file paths match (#2035) # Description Percent-encoded file paths of Remove actions were not properly deserialized, and when compared to active file paths, the paths didn't match, which caused tombstones to be recognized as active files (be kept in the state) # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: Igor Borodin <igborodi@microsoft.com> Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
natinimni
pushed a commit
to natinimni/delta-rs
that referenced
this issue
Jan 31, 2024
…s, to make sure tombstone and file paths match (delta-io#2035) Percent-encoded file paths of Remove actions were not properly deserialized, and when compared to active file paths, the paths didn't match, which caused tombstones to be recognized as active files (be kept in the state) <!--- For example: - closes delta-io#106 ---> <!--- Share links to useful documentation ---> --------- Co-authored-by: Igor Borodin <igborodi@microsoft.com> Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
RobinLin666
pushed a commit
to RobinLin666/delta-rs
that referenced
this issue
Feb 2, 2024
…s, to make sure tombstone and file paths match (delta-io#2035) # Description Percent-encoded file paths of Remove actions were not properly deserialized, and when compared to active file paths, the paths didn't match, which caused tombstones to be recognized as active files (be kept in the state) # Related Issue(s) <!--- For example: - closes delta-io#106 ---> # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: Igor Borodin <igborodi@microsoft.com> Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
ion-elgreco
pushed a commit
that referenced
this issue
Mar 4, 2024
# Description As requested by @ion-elgreco in #2229 , we should fix the formatter versions # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
ion-elgreco
added a commit
that referenced
this issue
Mar 21, 2024
# Description The description of the main changes of your pull request # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
ion-elgreco
pushed a commit
that referenced
this issue
Apr 1, 2024
# Description The description of the main changes of your pull request # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
ion-elgreco
added a commit
that referenced
this issue
Apr 22, 2024
# Description The description of the main changes of your pull request # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
ion-elgreco
pushed a commit
that referenced
this issue
Jun 11, 2024
# Description Updates the arrow and datafusion dependencies to 52 and 39(-rc1) respectively. This is necessary for updating pyo3. While most changes with trivial, some required big rewrites. Namely, the logic for the Updates operation had to be rewritten (and simplified) to accommodate some new sanity checks inside datafusion: (apache/datafusion#10088). Depends on delta-kernel having its arrow and object-store version bumped as well. This PR doesn't include any major changes for pyo3, I'll open a separate PR depending on this PR. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
ion-elgreco
pushed a commit
that referenced
this issue
Jun 14, 2024
# Description This migrates the Python package to use the new pyo3 bounds-based API, which allows more control over memory management on the library side and theoretical performance improvements (I benchmarked, and didn't notice anything substantial). The old API will be removed in 0.22. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
ion-elgreco
pushed a commit
that referenced
this issue
Jun 19, 2024
# Description Object stores expected fixed lengths for all multipart upload parts right up until the last part. The original logic just flushed when it exceeded the threshold. Now, it flushes when the threshold is met exclusively with the same fixed buffer, unless we're completing the transaction, in which case the last piece is allowed to be smaller. Bumps the constant to reflect that the minimum expected size by most object stores is 5MiB. Also adds a UserWarning if a constant is specified to be less. Also releases the GIL in more places by moving the flushing logic to a free function. # Related Issue(s) <!--- For example: - closes #106 ---> Closes #2605 # Documentation <!--- Share links to useful documentation ---> See: [MultipartUpload](https://docs.rs/object_store/latest/object_store/trait.MultipartUpload.html) docs
ion-elgreco
pushed a commit
that referenced
this issue
Jun 21, 2024
# Description Add support for HDFS using [hdfs-native](https://github.com/Kimahriman/hdfs-native), a pure* Rust client for interacting with HDFS. Creates a new `hdfs` sub-crate, adds it as a feature to `deltalake` meta crate, and includes it in Python wheels by default. There is a Rust integration test that requires Hadoop and Java to be installed, and makes use of a small Maven program I ship under the `integration-test` feature flag to run a MiniDFS server. *Dynamically loads `libgssapi_krb5` using `libloading` for Kerberos support # Related Issue(s) <!--- For example: - closes #106 ---> Resolves #2611 # Documentation <!--- Share links to useful documentation --->
ion-elgreco
added a commit
that referenced
this issue
Jun 24, 2024
# Description The description of the main changes of your pull request # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
rtyler
pushed a commit
that referenced
this issue
Jul 18, 2024
…ipeline (#2679) Currently, `ruff` and `mypy` have their latest versions installed in the CI pipeline, while locally they are fixed to a specific version. This can cause issues, see #2678. This PR proposes to fix them to their specific version in the pipeline. The alternative I could think of was installing the virtual environment with `make develop`, but that takes between 4 and 5 minutes, which might be considered a bit too long to wait on linting results. This PR will have conflicts with #2674, so I'll need to rebase one of these PR's once the other is merged. # Related Issue(s) - closes [#106](#2678)
ion-elgreco
pushed a commit
that referenced
this issue
Jul 20, 2024
…peline (#2687) # Description The CI/CD pipeline currently contains some duplication; This PR proposes to simplify that a bit by creating a reusable action to set up Python and Rust. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
ion-elgreco
pushed a commit
that referenced
this issue
Jul 20, 2024
# Description This PR proposes to add a `make test-cov` command to the `Makefile` to make it easier for contributors to check test coverage. Output currently looks as follows: ``` [...] tests/test_writerproperties.py::test_write_with_writerproperties PASSED ---------- coverage: platform darwin, python 3.11.2-final-0 ---------- Name Stmts Miss Branch BrPart Cover ------------------------------------------------------------- deltalake/__init__.py 11 0 0 0 100% deltalake/_util.py 16 1 12 1 93% deltalake/data_catalog.py 6 0 0 0 100% deltalake/exceptions.py 5 0 0 0 100% deltalake/fs.py 44 11 2 0 76% deltalake/schema.py 60 0 22 0 100% deltalake/table.py 431 42 174 23 89% deltalake/writer.py 267 98 175 14 58% ------------------------------------------------------------- TOTAL 840 152 385 38 78% Coverage HTML written to dir htmlcov ``` Also removed `--cov=deltalake` from the `pytest` ini commands, because I think that is better moved to `[tool.coverage.run]`. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
ion-elgreco
pushed a commit
that referenced
this issue
Jul 21, 2024
ion-elgreco
pushed a commit
that referenced
this issue
Jul 21, 2024
# Description The codebase contains the following code twice: ``` if sys.version_info >= (3, 8): from typing import Literal else: from typing_extensions import Literal ``` I believe this can be removed, since [pyproject.toml](https://github.com/delta-io/delta-rs/blob/f432c4f8337c2b0d47958645684e5df336c61522/python/pyproject.toml#L10) specifies that the minimum Python version for the project is 3.8: ```toml requires-python = ">=3.8" ``` # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
binding/rust
Issues for the Rust crate
enhancement
New feature or request
help wanted
Extra attention is needed
JSON delta log entries should be consolidated into a parquet snapshot periodically so that log readers can deserialize the log quickly. The spark delta reference implementation consolidates json log entries into parquet checkpoint snapshots after every 10th JSON log formatted log entry. This behavior is also mentioned under the Checkpoints section in the delta protocol. delta-rs should offer a utility function to create parquet checkpoint snapshots similarly. An MVP implementation should minimally provide a parquet checkpoint utility function so that callers can schedule this appropriately.
The text was updated successfully, but these errors were encountered: