Implement NodeGetVolumeStats #238

kbasv · 2020-08-10T23:08:35Z

Is this a bug fix or adding new feature?
New Feature

What is this PR about? / Why do we need it?

Implements NodeGetVolumeStats using du under the hood and enables GET_VOLUME_STATS node capability.
Added a cache (map) to store the results of du since du can take longer than the default kubelet timeout for NodeGetVolumeStats rpc.
Provide a refreshRate option to control the rate at which du is invoked.

What testing is done?
Added Unit tests to confirm the behavior.
Tested the implementation E2E with EKS on EC2.

k8s-ci-robot · 2020-08-10T23:08:38Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: login-issues@jira.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2020-08-10T23:08:43Z

Welcome @kbasv!

It looks like this is your first PR to kubernetes-sigs/aws-efs-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/aws-efs-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-08-10T23:08:43Z

Hi @kbasv. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kbasv · 2020-08-11T00:08:35Z

I signed it

wongma7 · 2020-08-11T19:16:11Z

/ok-to-test

wongma7

approach is sound, some minor comments and questions we can discuss

wongma7 · 2020-08-11T19:57:47Z

cmd/main.go

@@ -32,6 +32,8 @@ func main() {
 		version                 = flag.Bool("version", false, "Print the version and exit")
 		efsUtilsCfgDirPath      = flag.String("efs-utils-config-dir-path", "/etc/amazon/efs/", "The path to efs-utils config directory")
 		efsUtilsStaticFilesPath = flag.String("efs-utils-static-files-path", "/etc/amazon/efs-static-files/", "The path to efs-utils static files directory")
+		volMetricsOptIn         = flag.Bool("volMetricsOptIn", false, "Opt in to emit volume metrics")


this should be dash separated instead of camel case to be consistent w/ other args

Sure, I'll update this to reflect other args

wongma7 · 2020-08-11T20:01:41Z

cmd/main.go

@@ -45,7 +47,8 @@ func main() {
 		os.Exit(0)
 	}

-	drv := driver.NewDriver(*endpoint, *efsUtilsCfgDirPath, *efsUtilsStaticFilesPath)
+	drv := driver.NewDriver(*endpoint, *efsUtilsCfgDirPath, *efsUtilsStaticFilesPath, *volMetricsOptIn, *volMetricsRefreshRate)
+	drv.SetNodeCapOptInFeatures(*volMetricsOptIn)


y not hide this in NewDriver since NewDriver takes in volMetricsOptIn anyway? not sure the benefit of a separate function

wongma7 · 2020-08-11T20:04:03Z

pkg/driver/node.go

@@ -187,6 +186,11 @@ func (d *Driver) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpublish
 		return nil, status.Error(codes.InvalidArgument, "Target path not provided")
 	}

+	if d.volMetricsOptIn {
+		klog.V(4).Infof("Evicting vol ID: %v, vol path : %v from cache", req.VolumeId, target)


it's possible to publish the same VolumeId to different targets.

I don't want to complicate this too much, but do we need some kind of ref counter to account for that so we only evict from cache if counter == 0?

I think counter will be a bit complex here. Do you know if a volume Id and volume path combination is unique? If yes, I can use it as the cache key

It will be unique but if we do that then we will be recalculating the same volume multiple times. Like if I have a volume fs-abcd:/root mounted to both /pod1/target and /pod2/target, the routine will calculate the disk usage of /root in fs-abcd twice, once for /pod1/target and again for /pod2/target which is unnecessary.

I now see what you mean with the counter, sure we can add a counter to evict only when all other volumeIds are unpublished.

wongma7 · 2020-08-11T20:08:07Z

pkg/driver/vol_statter.go

+	}
+
+	volUsed, ok := used.AsInt64()
+


nitpick, would like to see whitespace/newlines tightened up like here. Doesn't need to be a newline between the call and error checking. gofmt may help: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_aws-efs-csi-driver/238/pull-aws-efs-csi-driver-verify/1293265088089690114#1:build-log.txt%3A278

Sure, will update this

wongma7 · 2020-08-11T20:09:30Z

pkg/driver/vol_statter.go

+func (v VolStatterImpl) computeDiskUsage(volId, volPath string) {
+	klog.V(5).Infof("Compute Volume Metrics invoked for Vol ID: %v", volId)
+
+	used, err := fs.DiskUsage(volPath)


what happens if Unpublish is called on this volume in the middle of this routine? Would DiskUsage block the Unpublish unmount for the duration it's reading the volume?

The logs from my testing indicate UnPublish will not succeed when DiskUsage is running. I see "device is busy" associated with Unpublish command in the logs. Do you think it's okay to wait it out or find a way to force unpublish in the above scenario?

Darn, well I think it's okay to wait it out. We can leave it as a future improvement because we're relying on a library (fs.DiskUsage) and there's no way to cancel it.

In the worst-case, Unpublish could be delayed indefinitely if volume metrics period < computeDiskUsage execution time and computeDiskUsage ends up running 100% of the time.

In a less-bad case, Unpublish could be delayed by time it takes for DiskUsage to complete + 2 minutes. The 2 minutes comes from the exponential back off for Unpublish retries. https://github.com/kubernetes/kubernetes/blob/323f34858de18b862d43c40b2cced65ad8e24052/pkg/util/goroutinemap/exponentialbackoff/exponential_backoff.go#L33

I'll add a TODO to figure out a way to cancel running DU process in the future

wongma7 · 2020-08-12T17:03:25Z

cmd/main.go

@@ -32,6 +32,8 @@ func main() {
 		version                 = flag.Bool("version", false, "Print the version and exit")
 		efsUtilsCfgDirPath      = flag.String("efs-utils-config-dir-path", "/etc/amazon/efs/", "The path to efs-utils config directory")
 		efsUtilsStaticFilesPath = flag.String("efs-utils-static-files-path", "/etc/amazon/efs-static-files/", "The path to efs-utils static files directory")
+		volMetricsOptIn         = flag.Bool("volMetricsOptIn", false, "Opt in to emit volume metrics")
+		volMetricsRefreshRate   = flag.Float64("volMetricsRefreshRate", 5, "Refresh rate for volume metrics in minutes")


I think this should be renamed to period instead of rate

kbasv · 2020-08-13T01:21:44Z

/test pull-aws-efs-csi-driver-e2e

kbasv · 2020-08-13T05:04:02Z

/test pull-aws-efs-csi-driver-e2e

pdrakeweb · 2020-08-13T19:28:05Z

Generally, the code here makes sense to me, however the implementation seems a bit problematic. If I understand correctly, the underlying tool (du) essentially stats every file to count the size. For EFS file systems with many files, this may dramatically consume the available IOPS. As many volumes may be created on a single underlying EFS, many executions of du may then occur in parallel, further exacerbating the performance impacts. An ideal implementation seems like it would instead call an AWS API to query the disk usage for a specific directory on an EFS. I recognize such an API does not exist and AWS would need to create it.

wongma7 · 2020-08-13T20:16:59Z

You're right on all counts. This PR has an optimization to avoid reading the same volumehandle (fs-abcd:/a/b) concurrently on the same node but there's certainly room for more. If we think that the performance penalty of enabling this feature will outweigh the benefit more often than not, we can do more optimizations first. Personally I'm inclined to merge and collect feedback, but I don't want people to enable the feature and then be surprised if their credits get exhausted, so I guess we ought to test & evaluate the impact more first.

Some brainstorming:

We don't account for double reading of parent/child directories (fs-abcd:/a/b/c and fs-abcd:/a/b). Maybe we could only read leaf directories.
We don't account for the same file system being read from different nodes which we obviously have to since 1 of the main selling points of EFS is you can do that. This is not easy: would you have each daemonset member hold a per-filesystem lock (lease object) in kubernetes, would they write a file in the EFS file system itself as a lock, etc.?
EFS metered size has a pretty generous eventual consistency guarantee, they say "the value represents the actual size only if the file system is not modified for a period longer than a couple of hours." https://docs.aws.amazon.com/efs/latest/ug/API_FileSystemSize.html I imagine most users care only about metered size and they want this feature to find culprits. So it follows that the driver only needs to be about as accurate as metered size to be useful, i.e. default 5 minutes refresh is probably excessive.

kbasv · 2020-08-14T20:54:16Z

I agree with concerns on the underlying implementation's potential to consume all the available IOPS in a File System.
From my observation while testing this implementation on a cluster, in one minute du stats ~32k files with 2 Percent IO consumption. I think the following two options will limit Volume Statter's IOPS consumption:

Adding a jittered start to goroutine execution. This will minimize the number of volume statter routines executing simultaneously.
Introducing a rate limiter to limit the number of simultaneous executions per file system.

The above two options will address concerns for mounts at a per node basis. However, it will not address concerns of a file system mounted on multiple nodes. For a multi node mount, given how Daemonsets work, it is not easy to share states between nodes and any solution for multi node mounts seems out of scope for the current implementation.

pdrakeweb · 2020-08-14T21:11:22Z

A start jitter and rate limiter could be helpful. The test uncovered roughly what I would expect - that du manages to consume a large portion of the available IOPS. Without QoS across clients, this may have a significant impact on applications using the volume. This is of particular concern for larger file systems (think about volumes exceeding 10M files). If an AWS provided API for this metadata is not possible, another option would be to introduce a mechanism for limiting the IOPS consumed by du. The kernel has mechanisms (cgroups) to limit block I/O but I presume these would not apply as the underlying EFS volume is not a block device. An alternative du which includes an internal IOP throttle could be used (although it would need to be written).

kerbyhughes · 2020-08-18T18:53:59Z

pkg/driver/vol_statter.go

+	volUsageCache[volId] = volMetrics
+	delete(volStatterJobTracker, volId)


The map updates from this function are data races since the function is always called in a goroutine. The check in launchVolStatsRoutine to see if the volId is currently being processed is not sufficient - the check could be called at the same time as the delete here. I would suggest wrapping them in a sync mechanism. Removing the time.Sleep(waitTime) and stubbing out the external calls to the fs library will expose the data races in the test cases.

I think that should be okay since launchVolStatsRoutine will be invoked by kubelet approximately every minute.

I'd say if it's invoked infrequently, then the overhead of a lock would be negligible. What's the recovery path if this panics?

I see what you mean. Didn't realize go panics on concurrent reads and writes. Alternative would be to either use locks or sync.map. I guess locks are better given how kubelet works?

Correct, they are not safe for concurrent use and will panic under certain conditions. I'm not overly familiar with sync.map but I believe it was created for optimization under certain conditions when regular locks are not meeting your performance needs.

kerbyhughes · 2020-08-28T19:22:27Z

@kbasv I ran your updated code and the changes fixed the data races in volStatter. The race detector still shows a race where the test code itself does not wrap the map updates in a lock, but the rest looks good, thanks.

kbasv · 2020-08-29T17:00:43Z

@kbasv I ran your updated code and the changes fixed the data races in volStatter. The race detector still shows a race where the test code itself does not wrap the map updates in a lock, but the rest looks good, thanks.

@kerbyhughes Thanks for checking, fixed the test code with locks.

kbasv · 2020-08-29T17:27:39Z

/retest

kbasv · 2020-08-31T14:09:11Z

/retest

kbasv · 2020-09-10T18:07:44Z

/retest

kbasv · 2020-09-11T21:52:55Z

/retest

wongma7 · 2020-09-30T00:44:42Z

pkg/driver/vol_statter.go

+}
+
+func (v VolStatterImpl) computeVolumeMetrics(volId, volPath string, refreshRate float64, fsRateLimit int) (*volMetrics, error) {
+	if value, ok := v.retrieveFromCache(volId); ok {


What if I have two PVs mounted on this node with the same volume ID but different target path. Say volume A fs-123:/a and volume B fs-123:/b. It will be random which metric I get back.

Maybe not random, but one of A or B will be reported for both A and B.

However if you index the results cache by volume ID + path but the tracker just by volume ID, there might be situations where the metric for B is never computed. Let's say kubelet always calls get stats on A before B, then B will never get computed since the computation for A was just started and is in progress...

IIRC, volumeId from above consists of file system ID + subpath. So fs-123:/a A and fs-123:/b B will be two different volumeIDs and metric will be computed for both A and B.

The cache will ensure that a metric for volumeId fs-123:/a mounted separately as two volumes A & B will not be computed twice.

ok, makes sense, i see now you distinguish between fsid for fs rate limit and volid for metrics cache

kbasv · 2020-10-06T17:42:02Z

@wongma7 Let me know if you have more concerns on this PR. Otherwise can you please merge this? Thanks!

wongma7 · 2020-10-07T19:19:57Z

Please squash the commits and check coveralls. The bot will not merge if coveralls says coverage decreased

/lgtm
/approve

k8s-ci-robot · 2020-10-07T19:20:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kbasv, wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wongma7]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Enable GET_VOLUME_STATS node capability

wongma7 · 2020-10-07T20:31:54Z

/retest

wongma7 · 2020-10-07T20:45:17Z

pkg/driver/node_test.go

+			},
+		}
+	)
+	makeDir(validPath)


test environment doesn't like this, not sure if https://golang.org/pkg/io/ioutil/#TempDir will work either

kbasv · 2020-10-07T20:53:59Z

@wongma7 Looks like the coveralls test environment's temp directory creation is being flaky and that is pulling down the coverage. When running the test locally, temp directory creation works locally and the overall coverage is greater than 82%. The test environment succeeded in creating temp directory once here - https://coveralls.io/builds/32722046

wongma7 · 2020-10-07T21:08:56Z

@kbasv OK thanks for looking into it. Since the test works locally and in pull-aws-efs-csi-driver-unit (actually the output looks crazy because I see efs_watch_dog.go:236] stopping... being spammed infinitely but that's unrelated...), and technically coveralls is not supposed to block merge, I am inclined to merge this after the rest of the tests pass and leave the fix for coveralls as a followup

/lgtm

kbasv · 2020-10-07T21:30:10Z

/retest

wongma7 · 2020-10-07T22:29:54Z

Manually merging due to coveralls being flaky. I'll open an issue to track it

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 10, 2020

k8s-ci-robot requested review from d-nishi and justinsb August 10, 2020 23:08

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 10, 2020

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 11, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 11, 2020

wongma7 reviewed Aug 11, 2020

View reviewed changes

wongma7 self-assigned this Aug 11, 2020

wongma7 reviewed Aug 12, 2020

View reviewed changes

kbasv force-pushed the volStats branch 2 times, most recently from 7fba2b8 to 0df1a23 Compare August 12, 2020 21:33

kbasv force-pushed the volStats branch from 0df1a23 to f521e62 Compare August 13, 2020 02:57

kbasv force-pushed the volStats branch from f521e62 to c1885f8 Compare August 13, 2020 14:13

kbasv force-pushed the volStats branch from c1885f8 to 2e72bf9 Compare August 17, 2020 04:59

kerbyhughes reviewed Aug 18, 2020

View reviewed changes

kbasv force-pushed the volStats branch from 2e72bf9 to 76912d4 Compare August 26, 2020 00:50

kbasv force-pushed the volStats branch from 76912d4 to fce57bf Compare August 29, 2020 16:43

wongma7 reviewed Sep 30, 2020

View reviewed changes

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 2, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2020

* Implement NodeGetVolumeStats.

f21f441

* Enable GET_VOLUME_STATS node capability

kbasv force-pushed the volStats branch from 38ff8c5 to f21f441 Compare October 7, 2020 20:05

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2020

wongma7 reviewed Oct 7, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2020

wongma7 merged commit a5630ad into kubernetes-sigs:master Oct 7, 2020

wongma7 mentioned this pull request Oct 7, 2020

Coveralls is flaky with new NodeGetVolumeStats unit tests #259

Closed

wochanda mentioned this pull request Jan 22, 2021

Integrate with volume health monitor #311

Closed

kbasv mentioned this pull request Apr 23, 2021

REQUEST: New membership for kbasv kubernetes/org#2656

Closed

kbasv deleted the volStats branch April 26, 2021 16:50

		volUsageCache[volId] = volMetrics
		delete(volStatterJobTracker, volId)

Implement NodeGetVolumeStats #238

Implement NodeGetVolumeStats #238

Conversation

kbasv commented Aug 10, 2020

k8s-ci-robot commented Aug 10, 2020

k8s-ci-robot commented Aug 10, 2020

k8s-ci-robot commented Aug 10, 2020

kbasv commented Aug 11, 2020

wongma7 commented Aug 11, 2020

wongma7 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbasv Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbasv commented Aug 13, 2020

kbasv commented Aug 13, 2020

pdrakeweb commented Aug 13, 2020

wongma7 commented Aug 13, 2020

kbasv commented Aug 14, 2020

pdrakeweb commented Aug 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerbyhughes Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

kerbyhughes commented Aug 28, 2020

kbasv commented Aug 29, 2020

kbasv commented Aug 29, 2020

kbasv commented Aug 31, 2020

kbasv commented Sep 10, 2020

kbasv commented Sep 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbasv Sep 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbasv commented Oct 6, 2020

wongma7 commented Oct 7, 2020

k8s-ci-robot commented Oct 7, 2020

wongma7 commented Oct 7, 2020

Choose a reason for hiding this comment

kbasv commented Oct 7, 2020

wongma7 commented Oct 7, 2020

kbasv commented Oct 7, 2020

wongma7 commented Oct 7, 2020

kbasv Aug 12, 2020 •

edited

Loading

kerbyhughes Aug 19, 2020 •

edited

Loading

kbasv Sep 30, 2020 •

edited

Loading