[SPARK-47804] Add Dataframe cache debug log #45990

anchovYu · 2024-04-10T23:24:04Z

What changes were proposed in this pull request?

This PR adds a debug log for Dataframe cache that uses SQL conf to turn on. It logs necessary information on

cache hit during cache application (the application happens basically on every query)
cache miss
adding new cache entries
removing cache entries (including clear all entries)

Because every query applies cache, this log could be huge and should be only turned on during some debugging process, and should not enabled by default in production.

Example:

spark.conf.set("spark.sql.dataframeCache.logLevel", "warn")
val df = spark.range(1, 10)


df.collect()
{"ts":"2024-04-10T16:41:10.010-0700","level":"WARN","msg":"Dataframe cache miss for input plan:\nRange (1, 10, step=1, splits=Some(10))\n","logger":"org.apache.spark.sql.execution.CacheManager"}
{"ts":"2024-04-10T16:41:10.010-0700","level":"WARN","msg":"Last 20 Dataframe cache entry logical plans:\n[]","logger":"org.apache.spark.sql.execution.CacheManager"}

df.cache()
{"ts":"2024-04-10T16:42:18.647-0700","level":"WARN","msg":"Dataframe cache miss for input plan:\nRange (1, 10, step=1, splits=Some(10))\n","logger":"org.apache.spark.sql.execution.CacheManager"}
{"ts":"2024-04-10T16:42:18.647-0700","level":"WARN","msg":"Last 20 Dataframe cache entry logical plans:\n[]","logger":"org.apache.spark.sql.execution.CacheManager"}
{"ts":"2024-04-10T16:42:18.662-0700","level":"WARN","msg":"Added Dataframe cache entry:\nCachedData(\nlogicalPlan=Range (1, 10, step=1, splits=Some(10))\n\nInMemoryRelation=InMemoryRelation [id#2L], StorageLevel(disk, memory, deserialized, 1 replicas)\n   +- *(1) Range (1, 10, step=1, splits=10)\n)\n","logger":"org.apache.spark.sql.execution.CacheManager"}


df.count()
{"ts":"2024-04-10T16:43:36.033-0700","level":"WARN","msg":"Dataframe cache hit for input plan:\nRange (1, 10, step=1, splits=Some(10))\nmatched with cache entry:\nCachedData(\nlogicalPlan=Range (1, 10, step=1, splits=Some(10))\n\nInMemoryRelation=InMemoryRelation [id#2L], StorageLevel(disk, memory, deserialized, 1 replicas)\n   +- *(1) Range (1, 10, step=1, splits=10)\n)\n","logger":"org.apache.spark.sql.execution.CacheManager"}
{"ts":"2024-04-10T16:43:36.041-0700","level":"WARN","msg":"Dataframe cache hit plan change summary:\n Aggregate [count(1) AS count#13L]           Aggregate [count(1) AS count#13L]\n!+- Range (1, 10, step=1, splits=Some(10))   +- InMemoryRelation [id#2L], StorageLevel(disk, memory, deserialized, 1 replicas)\n!                                                  +- *(1) Range (1, 10, step=1, splits=10)","logger":"org.apache.spark.sql.execution.CacheManager"}


df.unpersist()
{"ts":"2024-04-10T16:44:15.965-0700","level":"WARN","msg":"Removed 1 Dataframe cache entries, with logical plans being \n[Range (1, 10, step=1, splits=Some(10))\n]","logger":"org.apache.spark.sql.execution.CacheManager"}

Why are the changes needed?

Easier debugging.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Run local spark shell.

Was this patch authored or co-authored using generative AI tooling?

No.

gengliangwang · 2024-04-12T21:51:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

      cachedData = cachedData.filterNot(cd => plansToUncache.exists(_ eq cd))
    }
    plansToUncache.foreach { _.cachedRepresentation.cacheBuilder.clearCache(blocking) }
+    CacheManager.logCacheOperation(s"Removed ${plansToUncache.size} Dataframe cache " +


I think we need to migrate to the structured logging API here.
Framework: #45729
Example migration PR: #45834

gengliangwang · 2024-04-12T21:54:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .checkValues(StorageLevelMapper.values.map(_.name()).toSet)
    .createWithDefault(StorageLevelMapper.MEMORY_AND_DISK.name())

+  val DATAFRAME_CACHE_LOG_LEVEL = buildConf("spark.sql.dataframeCache.logLevel")


Will we need to debug cache table as well? Shall we rename the config as
spark.sql.cache.logLevel

Also, let's make it an internal conf since it is for developers.

I kept the Dataframe cache naming to differentiate it from the RDD cache.

RDD is a Spark core concept. Anyway I respect your choice here.

gengliangwang · 2024-04-16T01:36:54Z

Thanks, merging to master

add debug logs

aa524d1

github-actions bot added the SQL label Apr 10, 2024

add new line

5fad0b9

gengliangwang reviewed Apr 12, 2024

View reviewed changes

anchovYu added 2 commits April 15, 2024 15:05

use structured logging

dd76c2b

internal conf

33a8cac

anchovYu requested a review from gengliangwang April 15, 2024 22:10

gengliangwang approved these changes Apr 15, 2024

View reviewed changes

gengliangwang closed this in f10ad3d Apr 16, 2024

gengliangwang mentioned this pull request May 6, 2024

[SPARK-48145][CORE] Remove logDebug and logTrace with MDC in JAVA structured logging framework #46405

Closed

gengliangwang mentioned this pull request Jan 10, 2025

[SPARK-50639][SQL] Improve warning logging in CacheManager #49276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-47804] Add Dataframe cache debug log #45990

[SPARK-47804] Add Dataframe cache debug log #45990

Uh oh!

anchovYu commented Apr 10, 2024 •

edited

Loading

Uh oh!

gengliangwang Apr 12, 2024

Uh oh!

gengliangwang Apr 12, 2024

Uh oh!

gengliangwang Apr 12, 2024

Uh oh!

anchovYu Apr 15, 2024

Uh oh!

gengliangwang Apr 16, 2024

Uh oh!

gengliangwang commented Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-47804] Add Dataframe cache debug log #45990

[SPARK-47804] Add Dataframe cache debug log #45990

Uh oh!

Conversation

anchovYu commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gengliangwang Apr 12, 2024

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 12, 2024

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 12, 2024

Choose a reason for hiding this comment

Uh oh!

anchovYu Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anchovYu commented Apr 10, 2024 •

edited

Loading