feat: Add specific configs for converting Spark Parquet and JSON data to Arrow #832

andygrove · 2024-08-15T02:19:41Z

Which issue does this PR close?

N/A

Rationale for this change

We already supported reading JSON but the user would need to know to add FileSourceScan and/or BatchScan to the existing spark.comet.sparkToColumnar.supportedOperatorList configuration setting, and this is not documented. We also did not have any tests for this use case.

This PR adds a specific spark.comet.scan.json.enabled config.

What changes are included in this PR?

Add a specific config for enabling JSON scans (disabled by default)
Add one test
Add a note to the user guide that we have experimental support for JSON scan

How are these changes tested?

andygrove · 2024-08-15T13:02:01Z

@Kimahriman @eejbyfeldt Could you review please? This extends your earlier work to make it more accessible to users and allows us to support native execution on JSON data sources. I'll create a separate PR for CSV support.

andygrove · 2024-08-15T13:04:25Z

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

-        df = spark.read.parquet(dir.toString())
-        checkSparkAnswerAndOperator(df.select("nested1.id"))
-        checkSparkAnswerAndOperator(df.select("nested1.nested2.id"))
+    Seq("", "parquet").foreach { v1List =>


I added this line so that we test with both v1 and v2 sources

Kimahriman · 2024-08-15T14:17:53Z

Hmmm do you need separate configs per type? Or should there just be a dedicated like "convert from non-supported scan" config that covers all input formats?

andygrove · 2024-08-15T14:25:20Z

Hmmm do you need separate configs per type? Or should there just be a dedicated like "convert from non-supported scan" config that covers all input formats?

That would probably be OK, but there is the potential case where users have multiple file types and there is a bug or performance issue with one specific type, so it would be beneficial to just disable that type. Also, at some point we will likely want to enable individual formats by default once they are well tested and performant.

Kimahriman · 2024-08-15T14:39:45Z

The other alternative in that case would be similar to the useV1SourceList, have a single config with a list of formats to do it for, but don't have a strong opinion either way

viirya · 2024-08-15T14:45:12Z

common/src/main/scala/org/apache/comet/CometConf.scala

+  val COMET_CONVERT_FROM_PARQUET_ENABLED: ConfigEntry[Boolean] =
+    conf("spark.comet.convert.parquet.enabled")
+      .doc(
+        "When enabled, data from Parquet v1 and v2 scans will be converted to Arrow format. Note " +
+          "that to enable native vectorized execution, both this config and " +
+          "'spark.comet.exec.enabled' need to be enabled.")
+      .booleanConf
+      .createWithDefault(false)


This might be confusing to COMET_NATIVE_SCAN_ENABLED. We probably can mention in the doc that this doesn't use Comet native scan but Spark parquet reader and convert it to Comet.

I added this in the docs:

## Parquet When `spark.comet.scan.enabled` is enabled, Parquet scans will be performed natively by Comet if all data types in the schema are supported. When this option is not enabled, the scan will fall back to Spark. In this case, enabling `spark.comet.convert.parquet.enabled` will immediately convert the data into Arrow format, allowing native execution to happen after that, but the process may not be efficient.

I'll take another pass at the config description though to make it more detailed

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

viirya · 2024-08-15T14:48:07Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

+        // v1 scan
+        case scan: FileSourceScanExec =>
+          scan.relation.fileFormat match {
+            case _: JsonFileFormat => CometConf.COMET_CONVERT_FROM_JSON_ENABLED.get(conf)


@parthchandra This is what we discussed.

Kimahriman · 2024-08-15T14:49:51Z

common/src/main/scala/org/apache/comet/CometConf.scala

+  val COMET_SPARK_TO_ARROW_ENABLED: ConfigEntry[Boolean] =
    conf("spark.comet.sparkToColumnar.enabled")


Do you prefer that name/should we update the config to sparkToArrow while we're making these updates?

I don't have a strong feeling whether we should change the config key or not. I think the description sufficiently explains what this does and I didn't want to cause extra work for people who are already using these configs.

If others think we should change it then I am fine with that too.

…ons.scala Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

eejbyfeldt

Sry about the slow review. Did not get time until today. Only have one minor comment on how the new configuration might break existing way of enabling it.

eejbyfeldt · 2024-08-16T13:17:12Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

+        case scan: BatchScanExec =>
+          scan.scan match {
+            case _: JsonScan => CometConf.COMET_CONVERT_FROM_JSON_ENABLED.get(conf)
+            case _: ParquetScan => CometConf.COMET_CONVERT_FROM_PARQUET_ENABLED.get(conf)
+            case _ => isSparkToArrowEnabled(conf, op)
+          }


Is it intended the new options that precedence over the old operator list in COMET_SPARK_TO_ARROW_SUPPORTED_OPERATOR_LIST? If someone already enabled all data source scans using that this change will disable it for parquet and json.

Maybe it would be surprising if the logic was

private def shouldApplySparkToColumnar(conf: SQLConf, op: SparkPlan): Boolean = { // Only consider converting leaf nodes to columnar currently, so that all the following // operators can have a chance to be converted to columnar. Leaf operators that output // columnar batches, such as Spark's vectorized readers, will also be converted to native // comet batches. CometSparkToColumnarExec.isSchemaSupported(op.schema) && ( isSparkToArrowEnabledOp(conf, op) || isSparkToArrowEnabledDataSource(conf, op)) } private def isSparkToArrowEnabledDataSource(conf: SQLConf, op: SparkPlan): Boolean = { op match { // Convert Spark DS v1 scan to Arrow format case scan: FileSourceScanExec => scan.relation.fileFormat match { case _: JsonFileFormat => CometConf.COMET_CONVERT_FROM_JSON_ENABLED.get(conf) case _: ParquetFileFormat => CometConf.COMET_CONVERT_FROM_PARQUET_ENABLED.get(conf) case _ => false } // Convert Spark DS v2 scan to Arrow format case scan: BatchScanExec => scan.scan match { case _: JsonScan => CometConf.COMET_CONVERT_FROM_JSON_ENABLED.get(conf) case _: ParquetScan => CometConf.COMET_CONVERT_FROM_PARQUET_ENABLED.get(conf) case _ => false } } } private def isSparkToArrowEnabledOp(conf: SQLConf, op: SparkPlan) = { op.isInstanceOf[LeafExecNode] && COMET_SPARK_TO_ARROW_ENABLED.get(conf) && { val simpleClassName = Utils.getSimpleName(op.getClass) val nodeName = simpleClassName.replaceAll("Exec$", "") COMET_SPARK_TO_ARROW_SUPPORTED_OPERATOR_LIST.get(conf).contains(nodeName) } }

So that we do not change the behavior of existing configs.

… to Arrow (apache#832) * Add option to enable JSON scan * support v1 and v2 data sources * formatting * format * Add schema check * fix regression * improve configs and docs * fix * improve test and fix typo in doc * renaming some variables but no change to public config names * improve test * improve test * Update spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * Update spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * improve config description * fix path --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

andygrove added 9 commits August 14, 2024 15:58

Add option to enable JSON scan

58fe17b

support v1 and v2 data sources

3502124

formatting

9537ac4

format

2eb2527

Add schema check

8f015dd

fix regression

f78b87d

improve configs and docs

111586f

fix

c998b06

improve test and fix typo in doc

961e381

andygrove changed the title ~~feat: Add specific configuration for enabling JSON scan support~~ feat: Add specific configs for converting Spark Parquet and JSON data to Arrow Aug 15, 2024

andygrove requested review from huaxingao, kazuyukitanimura and viirya August 15, 2024 13:02

andygrove commented Aug 15, 2024

View reviewed changes

andygrove added 3 commits August 15, 2024 07:11

renaming some variables but no change to public config names

6f47ec2

improve test

fa46035

improve test

6907482

viirya reviewed Aug 15, 2024

View reviewed changes

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 15, 2024

View reviewed changes

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 15, 2024

View reviewed changes

Kimahriman reviewed Aug 15, 2024

View reviewed changes

andygrove and others added 3 commits August 15, 2024 11:13

Update spark/src/main/scala/org/apache/comet/CometSparkSessionExtensi…

2234277

…ons.scala Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

Update spark/src/main/scala/org/apache/comet/CometSparkSessionExtensi…

a828d8f

…ons.scala Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

improve config description

dc31a02

viirya approved these changes Aug 15, 2024

View reviewed changes

fix path

95ce2b6

andygrove merged commit 6051232 into apache:main Aug 16, 2024
74 checks passed

andygrove deleted the add-json-scan-config branch August 16, 2024 12:46

eejbyfeldt reviewed Aug 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add specific configs for converting Spark Parquet and JSON data to Arrow #832

feat: Add specific configs for converting Spark Parquet and JSON data to Arrow #832

andygrove commented Aug 15, 2024

andygrove commented Aug 15, 2024

andygrove Aug 15, 2024

Kimahriman commented Aug 15, 2024

andygrove commented Aug 15, 2024

Kimahriman commented Aug 15, 2024

viirya Aug 15, 2024

andygrove Aug 15, 2024

viirya Aug 15, 2024

Kimahriman Aug 15, 2024

andygrove Aug 15, 2024

eejbyfeldt left a comment

eejbyfeldt Aug 16, 2024

		val COMET_SPARK_TO_ARROW_ENABLED: ConfigEntry[Boolean] =
		conf("spark.comet.sparkToColumnar.enabled")

feat: Add specific configs for converting Spark Parquet and JSON data to Arrow #832

feat: Add specific configs for converting Spark Parquet and JSON data to Arrow #832

Conversation

andygrove commented Aug 15, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove commented Aug 15, 2024

andygrove Aug 15, 2024

Choose a reason for hiding this comment

Kimahriman commented Aug 15, 2024

andygrove commented Aug 15, 2024

Kimahriman commented Aug 15, 2024

viirya Aug 15, 2024

Choose a reason for hiding this comment

andygrove Aug 15, 2024

Choose a reason for hiding this comment

viirya Aug 15, 2024

Choose a reason for hiding this comment

Kimahriman Aug 15, 2024

Choose a reason for hiding this comment

andygrove Aug 15, 2024

Choose a reason for hiding this comment

eejbyfeldt left a comment

Choose a reason for hiding this comment

eejbyfeldt Aug 16, 2024

Choose a reason for hiding this comment