Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"target_file_size_bytes": 134217728,
"compaction_strategy": "bin-pack",
"max-concurrent-file-group-rewrites": 5,
"my-key": "my-value"
"key1": "value1"
}
}
]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)",
"$id": "https://polaris.apache.org/schemas/policies/system/metadata-compaction/2025-02-03.json",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the JSON magically land at the location specified in the ID somehow? Or do we always need a followup PR?

Also, it looks a little funny to use dates here given that the date in the PR may not align with the date the schema actually becomes effective. In the worst case, we could merge two versions in one day. Maybe just an incrementing number is easier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no. I hope I can publish it once for all based on the directory structure.

I want to keep them as the same for date as these are first batch, should be fine as nobody is using it. Once we release them at 1.0. We should follow the date schema strictly.

"title": "Metadata Compaction Policy",
"description": "Inheritable Polaris policy schema for Iceberg table metadata compaction.",
"type": "object",
"properties": {
"version": {
"type": "string",
"const": "2025-02-03",
"description": "Schema version."
},
"enable": {
"type": "boolean",
"description": "Enable or disable metadata compaction."
},
"config": {
"type": "object",
"description": "A map containing custom configuration properties. Please note that interoperability is not guaranteed.",
"additionalProperties": {
"type": ["string", "number", "boolean"]
}
}
},
"required": ["enable"],
"additionalProperties": false,
"examples": [
{
"version": "2025-02-03",
"enable": true,
"config": {
"spec_id": 1,
"key1": "value1"
}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
{
"license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)",
"$id": "https://polaris.apache.org/schemas/policies/system/orphan-file-removal/2025-02-03.json",
"title": "Orphan File Removal Policy",
"description": "Inheritable Polaris policy schema for Iceberg table orphan file removal.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What does Inheritable mean here?
  2. Polaris seems redundant, all of these are going to be Polaris schemas right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A inheritable policy means it can be applied to the under layer entities. For example, all tables under a namespace get the policies if it is assigned to the namespace.
I'm OK to remove it or keep it. Keeping it provides a complete view for anyone who read the schema, but without too much context of Polaris.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all policies inheritable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. For example, column masking policy are not inheritable, which makes more sense, and also that's what most engines do.

"type": "object",
"properties": {
"version": {
"type": "string",
"const": "2025-02-03",
"description": "Schema version."
},
"enable": {
"type": "boolean",
"description": "Enable or disable orphan file removal."
},
"max_orphan_file_age_in_days": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyrain this also struggles me a bit, remove orphan policy can be even expressed in more than in just file age. We don't we opt for the config key similar to the other policies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove orphan policy can be even expressed in more than in just file age.

Can you name them? We can put them into schema if they are commonly used, otherwise, the config map would be the best place to be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for the config map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no extra field is suggested, we could keep it as is.

"type": "number",
"description": "Specifies the maximum age (in days) for orphaned files before they are eligible for removal."
},
"location": {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making this property multi-value? This way, a user could support adding paths for multiple "namespaces".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you don't mean namespaces but just "locations"?

I think having one policy map to one location is probably okay for now; I'm not sure if/how we plan to handle overlapping locations though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could support that. One use case is that table files might be stored in different locations based on the write.data.path and/or write.metadata.path settings. This is generally not recommended though, due to issues like it makes credential vending harder. Are there any other use cases you have in mind, @ashvina?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlapping locations is very dangerous. No orphan file removal should happen in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it a string array type in the new commit.

"type": "string",
"description": "Specifies a custom directory to search for files instead of the default table location. Use with caution—if set to a broad location (e.g., s3://my-bucket instead of s3://my-bucket/my-table-location), all unreferenced files in that path may be permanently deleted, including files from other tables. Following best practices, tables should be stored in separate locations to avoid accidental data loss."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard to read in an IDE like IntelliJ as it is a single long line. Json doesn't support a way to break one line to multiple lines. This makes me think we may use the format yaml instead of json.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is just to put line breaks in the string, and then preprocess it anywhere we want to strip out the whitespace

Copy link
Contributor Author

@flyrain flyrain Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put line breaks in the string

One of my versions was that :). It didn't help much, esp. in the IDE.

},
"config": {
"type": "object",
"description": "A map containing custom configuration properties. Note that interoperability is not guaranteed.",
"additionalProperties": {
"type": ["string", "number", "boolean"]
}
}
},
"required": ["enable"],
"additionalProperties": false,
"examples": [
{
"version": "2025-02-03",
"enable": true,
"max_orphan_file_age_in_days": 30,
"location": "s3://my-bucket/my-table-location",
"config": {
"prefix_mismatch_mode": "ignore",
"key1": "value1"
}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{
"license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)",
"$id": "https://polaris.apache.org/schemas/policies/system/snapshot-retention/2025-02-03.json",
"title": "Snapshot Retention Policy",
"description": "Inheritable Polaris policy schema for Iceberg table snapshot retention.",
"type": "object",
"properties": {
"version": {
"type": "string",
"const": "2025-02-03",
"description": "Schema version."
},
"enable": {
"type": "boolean",
"description": "Enable or disable snapshot retention."
},
"config": {
"type": "object",
"description": "A map containing custom configuration properties. Please note that interoperability is not guaranteed.",
"additionalProperties": {
"type": ["string", "number", "boolean"]
}
}
},
"required": ["enable"],
"additionalProperties": false,
"examples": [
{
"version": "2025-02-03",
"enable": true,
"config": {
"min_snapshot_to_keep": 1,
"max_snapshot_age_days": 2,
"max_ref_age_days": 3,
"key1": "value1"
}
}
]
}