Skip to content

Conversation

cemakd
Copy link
Contributor

@cemakd cemakd commented Sep 23, 2025

What type of PR is this?
/kind bug

What this PR does / why we need it:
The GKE Data Cache was exhibiting extreme tail latency on partially
cached volumes, performing significantly worse than non-cached volumes.
Initial investigation into LVM cache policies and atime-induced
write-back thrashing did not reveal the root cause.

A series of fio benchmarks comparing cached and uncached volumes
uncovered a bug where the LVM cache was being configured with a
massive 1 GB chunk size. This was traced to a failure in the
fetchPvSizeGiB function in cache.go. The pvs command syntax was
malformed, and a subsequent string parsing error on its empty output
caused the driver to fall back to the default maxChunkSize of 1 GB.

This 1 GB chunk size led to extreme "read and write amplification," where a
single 4KB cache miss would trigger an inefficient 1 GB read from the
backing disk, explaining the catastrophic per-miss penalty.

This change corrects the pvs command syntax and the string parsing
logic within the driver. With this fix, the cache is now correctly
configured with a reasonable chunk size (e.g., 384 KiB), which resolves
the extreme read and write amplification and restores performance to expected
levels.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix partial data cache tail latency by correcting the cache chunk size calc

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 23, 2025
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 23, 2025
…culation

Undo temporary chunk size changes

Fix temporary variable name change
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 23, 2025
Update existing unit test to have suffix GiB

Add comment to clarify cache size is always in GiB
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 23, 2025
@mattcary
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cemakd, mattcary

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 23, 2025
@sunnylovestiramisu
Copy link
Contributor

sunnylovestiramisu commented Sep 23, 2025

Waiting for e2e testing edge cases confirmation

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2025
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Sep 23, 2025

@cemakd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gcp-compute-persistent-disk-csi-driver-e2e-arm 648f174 link false /test pull-gcp-compute-persistent-disk-csi-driver-e2e-arm
pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019 648f174 link false /test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@cemakd
Copy link
Contributor Author

cemakd commented Sep 24, 2025

/unhold e2e verification complete, the new chunk size is 2,368 KiB (or ~2.31 MiB) on a cluster with data-cache-count=6

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2025
@k8s-ci-robot k8s-ci-robot merged commit 0061918 into kubernetes-sigs:master Sep 24, 2025
8 of 10 checks passed
@cemakd cemakd deleted the benchmark-fix2 branch September 24, 2025 18:06
@cemakd
Copy link
Contributor Author

cemakd commented Sep 24, 2025

/cherry-pick release-1.21

@k8s-infra-cherrypick-robot

@cemakd: new pull request created: #2177

In response to this:

/cherry-pick release-1.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants