Skip to content

Conversation

@KnightChess
Copy link
Contributor

@KnightChess KnightChess commented Jul 4, 2024

as #11552 describe, hoodie insert not support column specified, support this case:

  • user can specified column order when insert into/overwrite

Change Logs

only support spark version >= 3.2

  • if user specified column list, reorder plan with Project

spark34 not support when default value enable
apache/spark#36077

  • first: in this pr, spark34 support default value for insert into, it will regenerate the user specified cols
    so, no need deal with it in hudi side
  • second: in this pr, it will append hoodie meta field with default value, has some bug, it look like be fixed
    in spark35([SPARK-43742][SQL] Refactor default column value resolution spark#41262), so if user want specified cols in spark34, need disable default feature.

Impact

none

Risk level (write none, low medium or high below)

none

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jul 4, 2024
* changes in Spark 3.3
*/
def unapplyInsertIntoStatement(plan: LogicalPlan): Option[(LogicalPlan, Map[String, Option[String]], LogicalPlan, Boolean, Boolean)]
def unapplyInsertIntoStatement(plan: LogicalPlan): Option[(LogicalPlan, Seq[String], Map[String, Option[String]], LogicalPlan, Boolean, Boolean)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a doc for the new param?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

case lr: LogicalRelation =>
// Create a project if this is an INSERT INTO query with specified cols.
val projectByUserSpecified = if (userSpecifiedCols.nonEmpty) {
assert(lr.catalogTable.isDefined, "Missing catalog table")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ValidationUtils.checkState instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

plan match {
case InsertIntoTable(table, partition, query, overwrite, ifPartitionNotExists) =>
Some((table, partition, query, overwrite, ifPartitionNotExists))
Some((table, Seq.empty, partition, query, overwrite, ifPartitionNotExists))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should log some msg here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it unnecessary, because it is not supported at the sql grammar level


override def unapplyInsertIntoStatement(plan: LogicalPlan): Option[(LogicalPlan, Map[String, Option[String]], LogicalPlan, Boolean, Boolean)] = {
plan match {
case insert: InsertIntoStatement =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every subclass has it own impl, so I remove it

override def unapplyInsertIntoStatement(plan: LogicalPlan): Option[(LogicalPlan, Seq[String], Map[String, Option[String]], LogicalPlan, Boolean, Boolean)] = {
plan match {
case insert: InsertIntoStatement =>
Some((insert.table, Seq.empty, insert.partitionSpec, insert.query, insert.overwrite, insert.ifPartitionNotExists))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log some msg here?

@hudi-bot
Copy link
Collaborator

hudi-bot commented Jul 6, 2024

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor

@leesf Do you have intreast to reivew this PR?

spark.sessionState.conf.unsetConf("hoodie.datasource.write.operation")
}

test("Test insert into with special cols") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special -> specified

}
}

test("Test insert overwrite with special cols") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@leesf
Copy link
Contributor

leesf commented Jul 7, 2024

LGTM, with minor comments

@codope
Copy link
Member

codope commented Jul 11, 2024

@KnightChess Looks like this PR does not handle partition spec? Can you please check HUDI-7964?

@codope codope self-assigned this Jul 11, 2024
@KnightChess
Copy link
Contributor Author

@codope hi, need specified like insert into aaa (id, day, price, name, hour) values (2, '01', 12.2, 'bbb', '02') will trigger this logical, but you case not specified columns, and I test rollback this pr also has problem you point, so it's not caused by this submission

@codope
Copy link
Member

codope commented Jul 11, 2024

@KnightChess Yes it is not due to this PR. I just tested by creating a parquet table and it's still the same behavior. So, issue is something unrelated to Hudi. You could try as well:

spark-sql> drop table if exists test_table;
Time taken: 0.459 seconds
spark-sql>
         > create table test_table (
         >     ts BIGINT,
         >     id STRING,
         >     rider STRING,
         >     driver STRING,
         >     fare DOUBLE,
         >     city STRING,
         >     state STRING
         > )
         > USING parquet
         > PARTITIONED BY (state, city)
         > location 'file:///tmp/hudi_test_table';

spark-sql> INSERT INTO test_table VALUES (1695159649,'trip1','rider-A','driver-K',19.10,'san_francisco','california');

spark-sql> INSERT INTO test_table VALUES (1695091554,'trip2','rider-C','driver-M',27.70,'austin','texas');

This is what directory structure looks like under the base path:
Screenshot 2024-07-11 at 7 22 19 PM

@KnightChess KnightChess deleted the insert-into-cols-specified branch July 11, 2024 13:56
@KnightChess
Copy link
Contributor Author

@codope got it o( ̄▽ ̄)o

@KnightChess
Copy link
Contributor Author

@codope if specified, look like work well.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants