Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 67 additions & 42 deletions docs/src/partitioning-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,16 +211,14 @@ The column name is `partition_field_{i}` where `{i}` is the partition field's `f
This naming convention avoids potential conflicts with user-defined column names.
When a new partition spec version is defined, the `__manifest` table schema is updated accordingly to include any new partition columns.

| Column | Type | Description |
|------------------------------|----------|-----------------------------------------------------------------------------|
| `object_id` | `string` | Full namespace path with `$` separator (existing) |
| `object_type` | `string` | `"namespace"` or `"table"` (existing) |
| `metadata` | `string` | JSON-encoded metadata/properties (existing) |
| `read_version` | `uint64` | Table version for reads (optional, see [Transaction](#transaction)) |
| `read_branch` | `string` | Table branch for reads (optional, see [Transaction](#transaction)) |
| `read_tag` | `string` | Table tag for reads (optional, see [Transaction](#transaction)) |
| `partition_field_{field_id}` | `<type>` | Partition value for the field (nullable, inherited from parent namespaces) |
| ... | ... | Additional partition field columns as needed |
| Column | Type | Description |
|------------------------------|----------|----------------------------------------------------------------------------|
| `object_id` | `string` | Full namespace path with `$` separator (existing) |
| `object_type` | `string` | `"namespace"` or `"table"` (existing) |
| `metadata` | `string` | JSON-encoded metadata/properties (existing) |
| `read_version` | `uint64` | Table version for reads (optional, see [Transaction](#transaction)) |
| `partition_field_{field_id}` | `<type>` | Partition value for the field (nullable, inherited from parent namespaces) |
| ... | ... | Additional partition field columns as needed |

Partition values are inherited from parent namespaces - each row has all partition values from its ancestors.
See [Appendix C: Manifest Table Example](#appendix-c-manifest-table-example) for a complete example.
Expand Down Expand Up @@ -297,43 +295,67 @@ This design ensures backward compatibility while enabling partition strategy evo
Operations within a single partition table are ACID-compliant according to the Lance table specification.
Each partition is an independent Lance table, so reads and writes to a single partition follow standard Lance transaction semantics.

### Multi-Partition Transaction
### Multi-Partition Transaction (Weak)

By default, operations across multiple partitions have weaker guarantees:

- **Writes across partitions are not atomic or consistent**: A write that affects multiple partitions may partially succeed, leaving some partitions updated while others are not.
- **Reads across partitions are not isolated**: A read spanning multiple partitions may observe different versions of each partition, leading to inconsistent views.

To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version`, `read_branch`, and `read_tag` columns for a table.
These columns record which version of each partition table to read.
In multi-partition transaction weak mode, write operations are directly committed to the main branch of each table. It means users can always see the fresh state of each leaf
partition table without additional information from partitioned namespace.

#### Read Behavior
Users need to handle writes across partitions carefully because there is no ACID guarantees. One way is to use idempotent write like `insert overwrite` then retry
for whatever error. Another way is writing partitions one by one.

**Summary**
* Read Behavior: Readers should always read the latest version from the main branch.

* Write Behavior: Writers should always commit to the main branch.

* Conflict Resolution: No conflict resolution for cross-partition writes.

### Multi-Partition Transaction (Strong)

To enable stronger transactional guarantees across partitions, the `__manifest` table can optionally include `read_version` column for a table.
The `read_version` records which version of each partition table to read.

Users should specify one of the following combinations:
In strong mode, write operations will use detached commit to each table. A detached commit is invisible unless
the version is set, it makes sure the intermediate state of a transaction remains invisible. Users need to first get `read_version` from partitioned
namespace, then read the leaf partition table using the current version.

1. **`read_version` only**: Read the specified version from the main branch.
2. **`read_branch` + `read_version`**: Read the specified version from the specified branch.
3. **`read_tag` only**: Read the version referenced by the specified tag.
The `__manifest` table records all the versions of each partitioned table. A version of a partitioned table is recorded
as an entry, with `object_id` in format `table_id${version}` and `object_type` value `table_version`.

When all columns are NULL or not present, readers should read the latest version from the main branch.
Partitioned namespace uses `merge_insert` to commit a transaction, which makes sure the commit is atomic.

#### Read Behavior

1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
2. Search `__manifest` table to collect current `read_version`s for the partition tables to read.
3. Read the current version from the partition table.

#### Commit Behavior

Multi-partition transactions are guarded by commits against the `__manifest` table. A typical multi-partition write follows this pattern:

1. Write data to each affected partition table independently
2. Atomically update the `read_version` (and optionally `read_branch` or `read_tag`) of all affected partitions in a single `__manifest` commit
1. Transaction starts, recording the current version of `__manifest` table as snapshot id(S0).
2. Write data to each affected partition table independently
3. Get current version of `__manifest` table as snapshot id(S1), detect/resolve conflicts if `S1` is not `S0`.
4. Atomically update the `read_version` from `S1` to `S2` in a single `__manifest` commit. This should be done by `merge_insert`.

This ensures all-or-nothing visibility of changes across partitions.

#### Conflict Resolution

If concurrent commits have been committed to `__manifest` since the transaction began, the implementation must either:

1. Rebase the current commit onto the latest `__manifest` version and retry the commit, or
2. Fail the current commit and return an error to the caller

Implementations are responsible for ensuring the appropriate conflict detection and resolution strategy to guarantee ACID semantics during multi-partition transactions.
1. Based on `read_version` timeline to fetch the changes between S0 and S1.
2. For each updated table in S2:
* collect the transactions between S0 to S1;
* detect/resolve conflicts between the collected transactions and S2;
3. Commit the partitioned namespace if all conflicts are resolved.
4. Otherwise
* Rebase the current commit onto the latest `__manifest` version and retry the commit, or
* Fail the current commit and return an error to the caller.

## Appendices

Expand Down Expand Up @@ -431,15 +453,18 @@ The namespaces (`v1`, `v1$k7m2n9p4q8r5s3t6`, etc.) are tracked in the `__manifes

The `__manifest` table for a partitioned namespace with partition fields `event_date` (v1), `event_year` (v2) and `country` (v2), showing entries from both spec versions:

| object_id | object_type | metadata | read_version | read_branch | read_tag | partition_field_event_date | partition_field_event_year | partition_field_country |
|-----------------------------------------------|-------------|----------|--------------|-------------|----------|----------------------------|----------------------------|-------------------------|
| v1 | namespace | {} | NULL | NULL | NULL | NULL | NULL | NULL |
| v1$k7m2n9p4q8r5s3t6 | namespace | {} | NULL | NULL | NULL | 2025-12-10 | NULL | NULL |
| v1$k7m2n9p4q8r5s3t6$dataset | table | {} | 5 | NULL | NULL | 2025-12-10 | NULL | NULL |
| v2 | namespace | {} | NULL | NULL | NULL | NULL | NULL | NULL |
| v2$e9f0g1h2i3j4k5l6 | namespace | {} | NULL | NULL | NULL | NULL | 2025 | NULL |
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4 | namespace | {} | NULL | NULL | NULL | NULL | 2025 | US |
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | table | {} | 3 | NULL | NULL | NULL | 2025 | US |
| object_id | object_type | metadata | read_version | partition_field_event_date | partition_field_event_year | partition_field_country |
|------------------------------------------------|---------------|----------|----------------------|----------------------------|----------------------------|-------------------------|
| v1 | namespace | {} | NULL | NULL | NULL | NULL |
| v1$k7m2n9p4q8r5s3t6 | namespace | {} | NULL | 2025-12-10 | NULL | NULL |
| v1$k7m2n9p4q8r5s3t6$dataset | table | {} | NULL | 2025-12-10 | NULL | NULL |
| v1$k7m2n9p4q8r5s3t6$dataset$1 | table_version | {} | 1 | NULL | NULL | NULL |
| v1$k7m2n9p4q8r5s3t6$dataset$2 | table_version | {} | 11120734598765432152 | NULL | NULL | NULL |
| v2 | namespace | {} | NULL | NULL | NULL | NULL |
| v2$e9f0g1h2i3j4k5l6 | namespace | {} | NULL | NULL | 2025 | NULL |
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4 | namespace | {} | NULL | NULL | 2025 | US |
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | table | {} | NULL | NULL | 2025 | US |
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset$1 | table_version | {} | 1 | NULL | NULL | NULL |

Note: The root namespace properties (`partition_spec_v1`, `partition_spec_v2`, `schema`) are stored in the `__manifest` table's metadata, not as a row. The `object_id` uses `$` as the namespace path separator. Partition columns use the naming convention `partition_field_{field_id}` where `{field_id}` is the partition field's string identifier. Partition values are inherited from parent namespaces. When retrieving properties via API, partition values are converted to `partition.<field_id> = <value>` entries.

Expand All @@ -459,7 +484,7 @@ WHERE event_date = '2025-12-10' AND country = 'US'
The engine translates this to the following `__manifest` DataFusion query plan to examine related partition tables.

```sql
SELECT object_id, location, read_version, read_branch, read_tag
SELECT object_id, location, read_version
FROM __manifest
WHERE object_type = 'table'
AND (
Expand All @@ -480,14 +505,14 @@ One example way to perform such substitution is:

This query returns:

| object_id | location | read_version | read_branch | read_tag |
|----------------------------------------------|-------------------------------------------------------|--------------|-------------|----------|
| v1$k7m2n9p4q8r5s3t6$dataset | b4a3c2d1_v1$k7m2n9p4q8r5s3t6$dataset | 5 | NULL | NULL |
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | aabbccdd_v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | 3 | NULL | NULL |
| object_id | location | read_version |
|----------------------------------------------|-------------------------------------------------------|----------------------|
| v1$k7m2n9p4q8r5s3t6$dataset | b4a3c2d1_v1$k7m2n9p4q8r5s3t6$dataset | 18446744073709551615 |
| v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | aabbccdd_v2$e9f0g1h2i3j4k5l6$m7n8o9p0q1r2s3t4$dataset | 16045690984833335022 |

- For partition spec v1, the `country = 'US'` filter cannot be pushed to partition pruning (v1 has no `country` partition), so it must be applied during the table scan
- For partition spec v2, both filters are pushed down: `partition_field_event_year = 2025` (computed from `year(event_date)`) and `partition_field_country = 'US'`
- The engine reads each table at the version specified by `read_version`, `read_branch`, or `read_tag` for consistent snapshot reads
- The engine reads each table at the version specified by `read_version` for consistent snapshot reads

### Appendix E: Runtime Namespace Properties Example

Expand Down