Skip to content

Conversation

@andrewxue-db
Copy link
Contributor

What changes were proposed in this pull request?

SQLConf.caseSensitiveAnalysis is currently being retrieved for every column when resolving column names. This is expensive if there are many columns. We can instead retrieve it once before the loop, and reuse the result.

Why are the changes needed?

Performance improvement.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Profiles of adding 1 column on an empty 10k column table (hms-parquet):

Before (55s):
Screenshot 2024-06-10 at 12 17 49 PM

After (13s):
Screenshot 2024-06-10 at 12 17 04 PM

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jun 13, 2024
@andrewxue-db andrewxue-db force-pushed the andrewxue-db/spark-48622 branch from 6d5fb53 to 16a4bc5 Compare June 13, 2024 20:07
@yaooqinn yaooqinn closed this in be154a3 Jun 14, 2024
@yaooqinn
Copy link
Member

Merged to master, thank you @andrewxue-db

gengliangwang pushed a commit that referenced this pull request Jun 19, 2024
### What changes were proposed in this pull request?
This PR migrates Scala logging to comply with the scala style changes in [#46979](#46947)

### Why are the changes needed?
This makes development and PR review of the structured logging migration easier.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested by ensuring `dev/scalastyle` checks pass

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46980 from asl3/logging-migrationscala.

Authored-by: Amanda Liu <amanda.liu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Jun 23, 2024
…gEntry)

### What changes were proposed in this pull request?

This PR proposes two micro-optimizations for `SparkConf.get(ConfigEntry)`:

1. Avoid costly `Regex.replaceAllIn` for variable substitution: if the config value does not contain the substring `${` then it cannot possibly contain any variables, so we can completely skip the regex evaluation in such cases.
2. Avoid Scala collections operations, including `List.flatten` and `AbstractIterable.mkString`, for the common case where a configuration does not define a prepended configuration key.

### Why are the changes needed?

Improve performance.

This is primarily motivated by unit testing and benchmarking scenarios but it will also slightly benefit production queries.

Spark tries to avoid excessive configuration reading in hot paths (e.g. via changes like #46979). If we do accidentally introduce such read paths, though, then this PR's optimizations will help to greatly reduce the associated perf. penalty.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Correctness should be covered by existing unit tests.

To measure performance, I did some manual benchmarking by running

```
val conf = new SparkConf()
conf.set("spark.network.crypto.enabled", "true")
```

followed by

```
conf.get(NETWORK_CRYPTO_ENABLED)
```

10 million times in a loop.

On my laptop, the optimized code is ~7.5x higher throughput than the original.

We can also compare the before-and-after flamegraphs from a `while(true)` configuration reading loop, showing a clear difference in hotspots before and after this change:

**Before**:

![image](https://github.com/apache/spark/assets/50748/a60cec03-2400-46a5-95f5-f33b88a4872a)

**After**:

![image](https://github.com/apache/spark/assets/50748/10a45575-148b-4f5f-a431-b8036fe59866)

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Github Copilot

Closes #47049 from JoshRosen/SPARK-48678-sparkconf-perf-optimizations.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants