Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NodeGetVolumeStats #238

Merged
merged 1 commit into from
Oct 7, 2020
Merged

Conversation

kbasv
Copy link

@kbasv kbasv commented Aug 10, 2020

Is this a bug fix or adding new feature?
New Feature

What is this PR about? / Why do we need it?

  • Implements NodeGetVolumeStats using du under the hood and enables GET_VOLUME_STATS node capability.
  • Added a cache (map) to store the results of du since du can take longer than the default kubelet timeout for NodeGetVolumeStats rpc.
  • Provide a refreshRate option to control the rate at which du is invoked.

What testing is done?
Added Unit tests to confirm the behavior.
Tested the implementation E2E with EKS on EC2.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 10, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @kbasv!

It looks like this is your first PR to kubernetes-sigs/aws-efs-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/aws-efs-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @kbasv. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 10, 2020
@kbasv
Copy link
Author

kbasv commented Aug 11, 2020

I signed it

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 11, 2020
@wongma7
Copy link
Contributor

wongma7 commented Aug 11, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 11, 2020
Copy link
Contributor

@wongma7 wongma7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approach is sound, some minor comments and questions we can discuss

cmd/main.go Outdated
@@ -32,6 +32,8 @@ func main() {
version = flag.Bool("version", false, "Print the version and exit")
efsUtilsCfgDirPath = flag.String("efs-utils-config-dir-path", "/etc/amazon/efs/", "The path to efs-utils config directory")
efsUtilsStaticFilesPath = flag.String("efs-utils-static-files-path", "/etc/amazon/efs-static-files/", "The path to efs-utils static files directory")
volMetricsOptIn = flag.Bool("volMetricsOptIn", false, "Opt in to emit volume metrics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be dash separated instead of camel case to be consistent w/ other args

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update this to reflect other args

cmd/main.go Outdated
@@ -45,7 +47,8 @@ func main() {
os.Exit(0)
}

drv := driver.NewDriver(*endpoint, *efsUtilsCfgDirPath, *efsUtilsStaticFilesPath)
drv := driver.NewDriver(*endpoint, *efsUtilsCfgDirPath, *efsUtilsStaticFilesPath, *volMetricsOptIn, *volMetricsRefreshRate)
drv.SetNodeCapOptInFeatures(*volMetricsOptIn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y not hide this in NewDriver since NewDriver takes in volMetricsOptIn anyway? not sure the benefit of a separate function

@@ -187,6 +186,11 @@ func (d *Driver) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpublish
return nil, status.Error(codes.InvalidArgument, "Target path not provided")
}

if d.volMetricsOptIn {
klog.V(4).Infof("Evicting vol ID: %v, vol path : %v from cache", req.VolumeId, target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible to publish the same VolumeId to different targets.

I don't want to complicate this too much, but do we need some kind of ref counter to account for that so we only evict from cache if counter == 0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think counter will be a bit complex here. Do you know if a volume Id and volume path combination is unique? If yes, I can use it as the cache key

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be unique but if we do that then we will be recalculating the same volume multiple times. Like if I have a volume fs-abcd:/root mounted to both /pod1/target and /pod2/target, the routine will calculate the disk usage of /root in fs-abcd twice, once for /pod1/target and again for /pod2/target which is unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see what you mean with the counter, sure we can add a counter to evict only when all other volumeIds are unpublished.

}

volUsed, ok := used.AsInt64()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, would like to see whitespace/newlines tightened up like here. Doesn't need to be a newline between the call and error checking. gofmt may help: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_aws-efs-csi-driver/238/pull-aws-efs-csi-driver-verify/1293265088089690114#1:build-log.txt%3A278

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update this

func (v VolStatterImpl) computeDiskUsage(volId, volPath string) {
klog.V(5).Infof("Compute Volume Metrics invoked for Vol ID: %v", volId)

used, err := fs.DiskUsage(volPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if Unpublish is called on this volume in the middle of this routine? Would DiskUsage block the Unpublish unmount for the duration it's reading the volume?

Copy link
Author

@kbasv kbasv Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logs from my testing indicate UnPublish will not succeed when DiskUsage is running. I see "device is busy" associated with Unpublish command in the logs. Do you think it's okay to wait it out or find a way to force unpublish in the above scenario?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn, well I think it's okay to wait it out. We can leave it as a future improvement because we're relying on a library (fs.DiskUsage) and there's no way to cancel it.

In the worst-case, Unpublish could be delayed indefinitely if volume metrics period < computeDiskUsage execution time and computeDiskUsage ends up running 100% of the time.

In a less-bad case, Unpublish could be delayed by time it takes for DiskUsage to complete + 2 minutes. The 2 minutes comes from the exponential back off for Unpublish retries. https://github.com/kubernetes/kubernetes/blob/323f34858de18b862d43c40b2cced65ad8e24052/pkg/util/goroutinemap/exponentialbackoff/exponential_backoff.go#L33

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a TODO to figure out a way to cancel running DU process in the future

@wongma7 wongma7 self-assigned this Aug 11, 2020
cmd/main.go Outdated
@@ -32,6 +32,8 @@ func main() {
version = flag.Bool("version", false, "Print the version and exit")
efsUtilsCfgDirPath = flag.String("efs-utils-config-dir-path", "/etc/amazon/efs/", "The path to efs-utils config directory")
efsUtilsStaticFilesPath = flag.String("efs-utils-static-files-path", "/etc/amazon/efs-static-files/", "The path to efs-utils static files directory")
volMetricsOptIn = flag.Bool("volMetricsOptIn", false, "Opt in to emit volume metrics")
volMetricsRefreshRate = flag.Float64("volMetricsRefreshRate", 5, "Refresh rate for volume metrics in minutes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be renamed to period instead of rate

@kbasv kbasv force-pushed the volStats branch 2 times, most recently from 7fba2b8 to 0df1a23 Compare August 12, 2020 21:33
@kbasv
Copy link
Author

kbasv commented Aug 13, 2020

/test pull-aws-efs-csi-driver-e2e

@kbasv
Copy link
Author

kbasv commented Aug 13, 2020

/test pull-aws-efs-csi-driver-e2e

@pdrakeweb
Copy link

Generally, the code here makes sense to me, however the implementation seems a bit problematic. If I understand correctly, the underlying tool (du) essentially stats every file to count the size. For EFS file systems with many files, this may dramatically consume the available IOPS. As many volumes may be created on a single underlying EFS, many executions of du may then occur in parallel, further exacerbating the performance impacts. An ideal implementation seems like it would instead call an AWS API to query the disk usage for a specific directory on an EFS. I recognize such an API does not exist and AWS would need to create it.

@wongma7
Copy link
Contributor

wongma7 commented Aug 13, 2020

You're right on all counts. This PR has an optimization to avoid reading the same volumehandle (fs-abcd:/a/b) concurrently on the same node but there's certainly room for more. If we think that the performance penalty of enabling this feature will outweigh the benefit more often than not, we can do more optimizations first. Personally I'm inclined to merge and collect feedback, but I don't want people to enable the feature and then be surprised if their credits get exhausted, so I guess we ought to test & evaluate the impact more first.

Some brainstorming:

  • We don't account for double reading of parent/child directories (fs-abcd:/a/b/c and fs-abcd:/a/b). Maybe we could only read leaf directories.
  • We don't account for the same file system being read from different nodes which we obviously have to since 1 of the main selling points of EFS is you can do that. This is not easy: would you have each daemonset member hold a per-filesystem lock (lease object) in kubernetes, would they write a file in the EFS file system itself as a lock, etc.?
  • EFS metered size has a pretty generous eventual consistency guarantee, they say "the value represents the actual size only if the file system is not modified for a period longer than a couple of hours." https://docs.aws.amazon.com/efs/latest/ug/API_FileSystemSize.html I imagine most users care only about metered size and they want this feature to find culprits. So it follows that the driver only needs to be about as accurate as metered size to be useful, i.e. default 5 minutes refresh is probably excessive.

@kbasv
Copy link
Author

kbasv commented Aug 14, 2020

I agree with concerns on the underlying implementation's potential to consume all the available IOPS in a File System.
From my observation while testing this implementation on a cluster, in one minute du stats ~32k files with 2 Percent IO consumption. I think the following two options will limit Volume Statter's IOPS consumption:

  • Adding a jittered start to goroutine execution. This will minimize the number of volume statter routines executing simultaneously.
  • Introducing a rate limiter to limit the number of simultaneous executions per file system.

The above two options will address concerns for mounts at a per node basis. However, it will not address concerns of a file system mounted on multiple nodes. For a multi node mount, given how Daemonsets work, it is not easy to share states between nodes and any solution for multi node mounts seems out of scope for the current implementation.

@pdrakeweb
Copy link

A start jitter and rate limiter could be helpful. The test uncovered roughly what I would expect - that du manages to consume a large portion of the available IOPS. Without QoS across clients, this may have a significant impact on applications using the volume. This is of particular concern for larger file systems (think about volumes exceeding 10M files). If an AWS provided API for this metadata is not possible, another option would be to introduce a mechanism for limiting the IOPS consumed by du. The kernel has mechanisms (cgroups) to limit block I/O but I presume these would not apply as the underlying EFS volume is not a block device. An alternative du which includes an internal IOP throttle could be used (although it would need to be written).

Comment on lines +149 to +155
volUsageCache[volId] = volMetrics
delete(volStatterJobTracker, volId)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map updates from this function are data races since the function is always called in a goroutine. The check in launchVolStatsRoutine to see if the volId is currently being processed is not sufficient - the check could be called at the same time as the delete here. I would suggest wrapping them in a sync mechanism. Removing the time.Sleep(waitTime) and stubbing out the external calls to the fs library will expose the data races in the test cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be okay since launchVolStatsRoutine will be invoked by kubelet approximately every minute.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say if it's invoked infrequently, then the overhead of a lock would be negligible. What's the recovery path if this panics?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. Didn't realize go panics on concurrent reads and writes. Alternative would be to either use locks or sync.map. I guess locks are better given how kubelet works?

Copy link

@kerbyhughes kerbyhughes Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, they are not safe for concurrent use and will panic under certain conditions. I'm not overly familiar with sync.map but I believe it was created for optimization under certain conditions when regular locks are not meeting your performance needs.

@kerbyhughes
Copy link

@kbasv I ran your updated code and the changes fixed the data races in volStatter. The race detector still shows a race where the test code itself does not wrap the map updates in a lock, but the rest looks good, thanks.

@kbasv
Copy link
Author

kbasv commented Aug 29, 2020

@kbasv I ran your updated code and the changes fixed the data races in volStatter. The race detector still shows a race where the test code itself does not wrap the map updates in a lock, but the rest looks good, thanks.

@kerbyhughes Thanks for checking, fixed the test code with locks.

@kbasv
Copy link
Author

kbasv commented Aug 29, 2020

/retest

3 similar comments
@kbasv
Copy link
Author

kbasv commented Aug 31, 2020

/retest

@kbasv
Copy link
Author

kbasv commented Sep 10, 2020

/retest

@kbasv
Copy link
Author

kbasv commented Sep 11, 2020

/retest

}

func (v VolStatterImpl) computeVolumeMetrics(volId, volPath string, refreshRate float64, fsRateLimit int) (*volMetrics, error) {
if value, ok := v.retrieveFromCache(volId); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I have two PVs mounted on this node with the same volume ID but different target path. Say volume A fs-123:/a and volume B fs-123:/b. It will be random which metric I get back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not random, but one of A or B will be reported for both A and B.

However if you index the results cache by volume ID + path but the tracker just by volume ID, there might be situations where the metric for B is never computed. Let's say kubelet always calls get stats on A before B, then B will never get computed since the computation for A was just started and is in progress...

Copy link
Author

@kbasv kbasv Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, volumeId from above consists of file system ID + subpath. So fs-123:/a A and fs-123:/b B will be two different volumeIDs and metric will be computed for both A and B.

The cache will ensure that a metric for volumeId fs-123:/a mounted separately as two volumes A & B will not be computed twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, makes sense, i see now you distinguish between fsid for fs rate limit and volid for metrics cache

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 2, 2020
@kbasv
Copy link
Author

kbasv commented Oct 6, 2020

@wongma7 Let me know if you have more concerns on this PR. Otherwise can you please merge this? Thanks!

@wongma7
Copy link
Contributor

wongma7 commented Oct 7, 2020

Please squash the commits and check coveralls. The bot will not merge if coveralls says coverage decreased

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kbasv, wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2020
* Enable GET_VOLUME_STATS node capability
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2020
@wongma7
Copy link
Contributor

wongma7 commented Oct 7, 2020

/retest

},
}
)
makeDir(validPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test environment doesn't like this, not sure if https://golang.org/pkg/io/ioutil/#TempDir will work either

@kbasv
Copy link
Author

kbasv commented Oct 7, 2020

@wongma7 Looks like the coveralls test environment's temp directory creation is being flaky and that is pulling down the coverage. When running the test locally, temp directory creation works locally and the overall coverage is greater than 82%. The test environment succeeded in creating temp directory once here - https://coveralls.io/builds/32722046

@wongma7
Copy link
Contributor

wongma7 commented Oct 7, 2020

@kbasv OK thanks for looking into it. Since the test works locally and in pull-aws-efs-csi-driver-unit (actually the output looks crazy because I see efs_watch_dog.go:236] stopping... being spammed infinitely but that's unrelated...), and technically coveralls is not supposed to block merge, I am inclined to merge this after the rest of the tests pass and leave the fix for coveralls as a followup

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2020
@kbasv
Copy link
Author

kbasv commented Oct 7, 2020

/retest

@wongma7
Copy link
Contributor

wongma7 commented Oct 7, 2020

Manually merging due to coveralls being flaky. I'll open an issue to track it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants