-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement IcebergTableProviderFactory for datafusion #600
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yukkit for this pr! Generally looks good to me, I've left minor points to improve.
if !table_partition_cols.is_empty() { | ||
return Err(Error::new( | ||
ErrorKind::FeatureUnsupported, | ||
"Partition columns cannot be specified", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning different errors for each check, I think a more appropriate message would be like
Currently we only support reading existing icebergs tables in external table command. To create new table, please use catalog provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I roughly understand your intention. First, inform the user that external tables currently only support read operations, but this doesn't mean write operations won't be supported in the future. Secondly, provide a method for creating a new table.
// Field { name: "y", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"PARQUET:field_id": "2"} }, | ||
// Field { name: "z", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"PARQUET:field_id": "3"} } | ||
let actual_schema = table_provider.schema(); | ||
let expected_schema_str = "Field { name: \"x\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {\"PARQUET:field_id\": \"1\"} }, Field { name: \"y\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {\"PARQUET:field_id\": \"2\"} }, Field { name: \"z\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {\"PARQUET:field_id\": \"3\"} }"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pure format string comparison doesn't sound like a good idea to me. It would be better to use schema builder api to create schema and compare them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
|
||
#[derive(Default)] | ||
#[non_exhaustive] | ||
pub struct IcebergTableProviderFactory {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to provide some doc to explain its usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I will add doc for it
fn complement_namespace_if_necessary(table_name: &TableReference) -> Cow<'_, TableReference> { | ||
match table_name { | ||
TableReference::Bare { table } => { | ||
Cow::Owned(TableReference::partial("default", table.as_ref())) | ||
} | ||
other => Cow::Borrowed(other), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pay attention to this function, because when DataFusion
creates external tables, it only creates bare TableReference
, which does not include a schema
. However, when constructing TableIdent
in Iceberg, the 'schema' is required. Does anyone have a good approach for handling compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into an issue with external tables only creating bare TableReference
as well - do you know if thats the intended behavior or if its a bug in datafusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewmturner I believe this is a known issue, see https://github.com/apache/datafusion/blob/main/datafusion/sql/src/statement.rs#L1242
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone have a good approach for handling compatibility?
I'm not quite familiar with datafusion, but typicall it should be registered with current namespace? For example, a user has executed use ns1
before registering this table, then ideally this table should be register under ns1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024 You’re right. When creating an external table in DataFusion without specifying a schema, it will to being created under the default schema. In this case, the table name passed to TableProviderFactory
does not include schema information. However, TableIdent
requires a namespace, so I use "default" to populate this value. Anyway, the IcebergTableProvider
doesn't actually rely on the specific content of TableIdent
, so it has no impact when used as an external table in DataFusion (at least as things stand).
c350f7e
to
625cdbf
Compare
@liurenjie1024 Please help review the PR again when you get a chance |
0582384
to
eb466b5
Compare
remind @liurenjie1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yukkit for this pr, LGTM! Just some minor problems to fix, others are great!
/// An error will be returned if any unsupported feature, such as partition columns, | ||
/// order expressions, constraints, or column defaults, is detected in the table creation command. | ||
#[derive(Default)] | ||
#[non_exhaustive] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't quite understand why we need to add this annotation here?
fn complement_namespace_if_necessary(table_name: &TableReference) -> Cow<'_, TableReference> { | ||
match table_name { | ||
TableReference::Bare { table } => { | ||
Cow::Owned(TableReference::partial("default", table.as_ref())) | ||
} | ||
other => Cow::Borrowed(other), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone have a good approach for handling compatibility?
I'm not quite familiar with datafusion, but typicall it should be registered with current namespace? For example, a user has executed use ns1
before registering this table, then ideally this table should be register under ns1
?
Sorry for late reply, would you mind to fix comments and resolve conflicts? |
Thanks @yukkit for this pr! |
resolve #586
TODO: