Fix partial cache tail latency by correcting the cache chunk size calc #2176

cemakd · 2025-09-23T17:44:54Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
The GKE Data Cache was exhibiting extreme tail latency on partially
cached volumes, performing significantly worse than non-cached volumes.
Initial investigation into LVM cache policies and atime-induced
write-back thrashing did not reveal the root cause.

A series of fio benchmarks comparing cached and uncached volumes
uncovered a bug where the LVM cache was being configured with a
massive 1 GB chunk size. This was traced to a failure in the
fetchPvSizeGiB function in cache.go. The pvs command syntax was
malformed, and a subsequent string parsing error on its empty output
caused the driver to fall back to the default maxChunkSize of 1 GB.

This 1 GB chunk size led to extreme "read and write amplification," where a
single 4KB cache miss would trigger an inefficient 1 GB read from the
backing disk, explaining the catastrophic per-miss penalty.

This change corrects the pvs command syntax and the string parsing
logic within the driver. With this fix, the cache is now correctly
configured with a reasonable chunk size (e.g., 384 KiB), which resolves
the extreme read and write amplification and restores performance to expected
levels.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix partial data cache tail latency by correcting the cache chunk size calc

…culation Undo temporary chunk size changes Fix temporary variable name change

pkg/gce-pd-csi-driver/cache.go

pkg/gce-pd-csi-driver/cache_test.go

Update existing unit test to have suffix GiB Add comment to clarify cache size is always in GiB

mattcary · 2025-09-23T18:42:28Z

/lgtm
/approve

k8s-ci-robot · 2025-09-23T18:42:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cemakd, mattcary

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mattcary]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sunnylovestiramisu · 2025-09-23T18:46:48Z

Waiting for e2e testing edge cases confirmation

/hold

k8s-ci-robot · 2025-09-23T20:05:43Z

@cemakd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-gcp-compute-persistent-disk-csi-driver-e2e-arm	`648f174`	link	false	`/test pull-gcp-compute-persistent-disk-csi-driver-e2e-arm`
pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019	`648f174`	link	false	`/test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

cemakd · 2025-09-24T18:02:42Z

/unhold e2e verification complete, the new chunk size is 2,368 KiB (or ~2.31 MiB) on a cluster with data-cache-count=6

cemakd · 2025-09-24T20:04:21Z

/cherry-pick release-1.21

k8s-infra-cherrypick-robot · 2025-09-24T20:05:02Z

@cemakd: new pull request created: #2177

In response to this:

/cherry-pick release-1.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 23, 2025

k8s-ci-robot requested review from amacaskill and mattcary September 23, 2025 17:45

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 23, 2025

Fix partial cache tail latency by correcting the cache chunk size cal…

f54d16b

…culation Undo temporary chunk size changes Fix temporary variable name change

cemakd force-pushed the benchmark-fix2 branch from 392a81a to f54d16b Compare September 23, 2025 17:52

sunnylovestiramisu reviewed Sep 23, 2025

View reviewed changes

pkg/gce-pd-csi-driver/cache.go Show resolved Hide resolved

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 23, 2025

sunnylovestiramisu reviewed Sep 23, 2025

View reviewed changes

pkg/gce-pd-csi-driver/cache.go Show resolved Hide resolved

sunnylovestiramisu reviewed Sep 23, 2025

View reviewed changes

pkg/gce-pd-csi-driver/cache_test.go Show resolved Hide resolved

Add unit test to fetchChucnkSizeKiB

648f174

Update existing unit test to have suffix GiB Add comment to clarify cache size is always in GiB

cemakd force-pushed the benchmark-fix2 branch from 38b8836 to 648f174 Compare September 23, 2025 18:31

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 23, 2025

k8s-ci-robot assigned mattcary Sep 23, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 23, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2025

k8s-ci-robot merged commit 0061918 into kubernetes-sigs:master Sep 24, 2025
8 of 10 checks passed

cemakd deleted the benchmark-fix2 branch September 24, 2025 18:06

k8s-infra-cherrypick-robot mentioned this pull request Sep 24, 2025

[release-1.21] Fix partial cache tail latency by correcting the cache chunk size calc #2177

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix partial cache tail latency by correcting the cache chunk size calc #2176

Fix partial cache tail latency by correcting the cache chunk size calc #2176

cemakd commented Sep 23, 2025 •

edited by sunnylovestiramisu

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattcary commented Sep 23, 2025

Uh oh!

k8s-ci-robot commented Sep 23, 2025

Uh oh!

sunnylovestiramisu commented Sep 23, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Sep 23, 2025 •

edited

Loading

Uh oh!

cemakd commented Sep 24, 2025

Uh oh!

Uh oh!

cemakd commented Sep 24, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Sep 24, 2025

Uh oh!

Uh oh!

Fix partial cache tail latency by correcting the cache chunk size calc #2176

Fix partial cache tail latency by correcting the cache chunk size calc #2176

Conversation

cemakd commented Sep 23, 2025 • edited by sunnylovestiramisu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattcary commented Sep 23, 2025

Uh oh!

k8s-ci-robot commented Sep 23, 2025

Uh oh!

sunnylovestiramisu commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cemakd commented Sep 24, 2025

Uh oh!

Uh oh!

cemakd commented Sep 24, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Sep 24, 2025

Uh oh!

Uh oh!

cemakd commented Sep 23, 2025 •

edited by sunnylovestiramisu

Loading

sunnylovestiramisu commented Sep 23, 2025 •

edited

Loading

k8s-ci-robot commented Sep 23, 2025 •

edited

Loading