Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 495: Validate Pravega manifest settings before deployment #586

Merged
merged 6 commits into from
Nov 24, 2021

Conversation

nishant-yt
Copy link
Collaborator

@nishant-yt nishant-yt commented Oct 21, 2021

Signed-off-by: Nishant Gupta Nishant_Gupta3@dell.com

Change log description

While deploying Pravega , Segment Store settings should adhere to the following rules:
POD_MEM_LIMIT > JVM Heap (-Xmx) + JVM Direct Memory (-XX:MaxDirectMemorySize)
JVM Direct memory > Segment Store read cache size (pravegaservice.cache.size.max)

Here we are validating the above scenarios before deployment by the operator happens instead of manually relying on the user to set them right.

Purpose of the change

Fixes #495

What the code does

The code makes sure the following checks are performed by the operator before deploying Pravega:
POD_MEM_LIMIT > JVM Heap + JVM Direct Memory
JVM Direct memory > Segment Store read cache size

How to verify it

Try setting different values for .spec.pravega.segmentStoreResources.requests.memory , -Xmx , -XX:MaxDirectMemorySize and pravegaservice.cache.size.max and the operator should return error in case the above mentioned rules are not met.

@codecov-commenter
Copy link

codecov-commenter commented Oct 21, 2021

Codecov Report

Merging #586 (e515742) into master (7af939a) will increase coverage by 0.76%.
The diff coverage is 82.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #586      +/-   ##
==========================================
+ Coverage   74.17%   74.94%   +0.76%     
==========================================
  Files          16       16              
  Lines        4229     4402     +173     
==========================================
+ Hits         3137     3299     +162     
- Misses        962      972      +10     
- Partials      130      131       +1     
Impacted Files Coverage Δ
pkg/apis/pravega/v1beta1/pravega.go 98.78% <ø> (+0.44%) ⬆️
pkg/apis/pravega/v1beta1/pravegacluster_types.go 30.45% <82.75%> (+3.72%) ⬆️
pkg/apis/pravega/v1beta1/zz_generated.deepcopy.go 99.03% <0.00%> (-0.41%) ⬇️
pkg/controller/pravega/pravega_controller.go 100.00% <0.00%> (ø)
pkg/controller/pravega/pravega_segmentstore.go 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7af939a...e515742. Read the comment docs.

@@ -1194,6 +1206,66 @@ func (p *PravegaCluster) validateConfigMap() error {
return nil
}

func (p *PravegaCluster) ValidateSegmentStore() error {
totalMemoryLimitsQuantity := p.Spec.Pravega.SegmentStoreResources.Limits[corev1.ResourceMemory]
totalMemoryRequestsQuantity := p.Spec.Pravega.SegmentStoreResources.Requests[corev1.ResourceMemory]
Copy link
Contributor

@anishakj anishakj Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you verifying the scenario, limit is set but requests are not set for resource, also in that case do we need to set request same as limits

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the request to the limit value should be done in the defaults section, if I am not mistaken.

@nishant-yt Did you verify that totalMemoryRequestsQuantity.Value() below produces a zero when the memory request is NOT set by the user?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the code in defaults section to make sure requests is equal to limits in case requests is not set by the user. Also, I have verified totalMemoryRequestsQuantity.Value() does produces 0 value if memory requests is not set by the user

Comment on lines +38 to +49
Options: map[string]string{"pravegaservice.cache.size.max": "1610612736"},
SegmentStoreJVMOptions: []string{"-Xmx1g", "-XX:MaxDirectMemorySize=2560m"},
SegmentStoreResources: &corev1.ResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("1000m"),
corev1.ResourceMemory: resource.MustParse("4Gi"),
},
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("2000m"),
corev1.ResourceMemory: resource.MustParse("4Gi"),
},
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these values come from?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting all the defaults here without verifying user supplied values is extremely dangerous.

If the user supplies memory limit less than 4Gi (say, 2Gi) without memory request,
they will end with a 4Gi memory request (default) and 2Gi memory limit (user-supplied).

What we rather need to do while setting the defaults is:

  1. See if the user has set a memory request
  2. If the request is not set, then set it to memory limit (the admission webhook would be verifying that the memory limit is set by the user).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default values are only for e2e tests and the checks are already in place for all the scenarios

@@ -1194,6 +1206,66 @@ func (p *PravegaCluster) validateConfigMap() error {
return nil
}

func (p *PravegaCluster) ValidateSegmentStore() error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add comments in this method explaining what it does and why. Also, the name of the method should be more concrete, like ValidateSegmentStoreMemorySettings, as ValidateSegmentStore is not really descriptive about what it is doing.

maxDirectMemorySize := maxDirectMemoryQuantity.Value()
cacheSize := cacheSizeQuantity.Value()

if totalMemoryLimits <= maxDirectMemorySize+xmx {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please, leave a space after and before the + symbol (it is hard to read otherwise).

return fmt.Errorf("MaxDirectMemorySize(%v B) along with JVM Xmx value(%v B) is greater than or equal to the total available memory(%v B)!", maxDirectMemorySize, xmx, totalMemoryLimits)
}

if maxDirectMemorySize <= xmx {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove this part of the validation (also from the PR description), as this may be completely legitimate. You may want to configure a JVM of 4GB with a cache of 3GB (and therefore setting maxDirectMemorySize=4GB). And there is nothing strictly wrong with that, is just a user choice. We have to validate to prevent what we know for sure is wrong, but not more than that, as we also want to give some degree of flexibility to users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this document, we should generally provision the Segment Store with more direct memory than heap memory. So, shouldn't this validation be there. I agree with your example and can remove the equality from the check so as to provide flexibility to the user.

@anishakj
Copy link
Contributor

anishakj commented Nov 9, 2021

@nishant-yt Could you please add end to end tests to cover the webhook validations?

// JVM Direct memory > Segment Store read cache size (pravegaservice.cache.size.max).
func (p *PravegaCluster) ValidateSegmentStoreMemorySettings() error {
if p.Spec.Pravega.SegmentStoreResources == nil {
return fmt.Errorf("Missing required value for field spec.pravega.segmentStoreResources.limits.memory")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaulGracia Should we allow deployments without setting the resources?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the offline discussion, we do need a check for p.Spec.Pravega.SegmentStoreResources not to be nil.
If we allow the nil object here, skip the below validation and let the defaults be assigned later by withDefaults() function, we cannot perform a check for user-supplied pravegaservice.cache.size.max value or JVM memory settings. Therefore it would make sense to make p.Spec.Pravega.SegmentStoreResources a required block.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for usability in the first usage, allowing a cluster to be deployed with default values makes sense, but it would not be at all a production-ready configuration. And of course, not at the cost of not doing the memory checks.

Comment on lines 331 to 448
if s.SegmentStoreResources.Limits[v1.ResourceCPU] == (resource.Quantity{}) {
changed = true
s.SegmentStoreResources.Limits[v1.ResourceCPU] = resource.MustParse(DefaultSegmentStoreLimitCPU)
}

if s.SegmentStoreResources.Requests[v1.ResourceCPU] == (resource.Quantity{}) {
changed = true
s.SegmentStoreResources.Requests[v1.ResourceCPU] = resource.MustParse(DefaultSegmentStoreRequestCPU)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is a good idea.
If the user sets the CPU request and is not aware of the default limit, we can create a situation where the limit is below the request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always ask for the memory limit, while the request may be optional. In case of setting only limit, request can be set to limit, as this seems to be the behavior of Kubernetes: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaulGracia What about CPU?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think that CPU is less of a problem, it may be good also bound the CPU resources similarly to what is being done here for memory. In this case, we should i) enforce that at least limit on CPU is set (to prevent a busy or malfunctioning pod starving all other pods in the node), ii) if requests is not set, default it to limit, and iii) if requests is set, check that it is <= limit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaulGracia The code I'm commenting on is located in withDefaults() function and is invoked after the CR has passed the validation webhook. Therefore there are no checks or enforcement are done in this area of the code.

My concern is that when the user creates a CR with high CPU request and no CPU limit, like this (memory settings aside):

segmentStore:
  resources:
    requests:
      cpu: 2000m

then line 333 in the current code will transform it into the following, which is wrong:

  resources:
    requests:
      cpu: 2000m
    limits:
      cpu: 1

Again, if the user creates a CR with low CPU limit and no CPU request, like this:

segmentStore:
  resources:
    limits:
      cpu: 200m

then line 338 of the current code will again transform it into a wrong configuration:

segmentStore:
  resources:
    requests:
      cpu: 500m
    limits:
      cpu: 200m 

My ask is to not assign the CPU request/limit defaults independently of each other (i.e., of what the user might have set for the other parameter).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkhalack actually I will be modifying the withDefaults() method as shown below so as to have the same check what we have for memory in addition to the checks what @RaulGracia mentioned :

if s.SegmentStoreResources.Requests[v1.ResourceCPU] == (resource.Quantity{}) {
		changed = true
		s.SegmentStoreResources.Requests[v1.ResourceCPU] = s.SegmentStoreResources.Limits[v1.ResourceCPU]
	}

return fmt.Errorf("Missing required value for field spec.pravega.segmentStoreResources.limits.memory")
}

totalMemoryLimitsQuantity := p.Spec.Pravega.SegmentStoreResources.Limits[corev1.ResourceMemory]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a check that p.Spec.Pravega.SegmentStoreResources.Limits != nil and that p.Spec.Pravega.SegmentStoreResources.Requests != nil before we go ahead and check for memory settings (the motivation being that the user might set requests without limits or limits without requests).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that limits should not be null. We could leave requests nullable, and if so, set requests=limits as it seems to be the underlying behavior in Kubernetes itself.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still needs to be a check for a nil Requests object before we try to assign Requests[corev1.ResourceMemory] to anything (like we currently do it on the next line)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed , I will be adding a nil check for both Requests and Limits Objects.

Comment on lines 1223 to 1228
if p.Spec.Pravega.SegmentStoreResources.Requests == nil {
return fmt.Errorf("spec.pravega.segmentStoreResources.requests cannot be empty")
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR changes in pravega.go do allow for the requests not being set.
The better way to deal with the situation is to set totalMemoryRequests to zero if segmentStoreResources.requests is not set (otherwise proceed normally).

Copy link

@jkhalack jkhalack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@RaulGracia RaulGracia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one comment.

@@ -323,6 +323,24 @@ func (s *PravegaSpec) withDefaults() (changed bool) {
}
}

if s.SegmentStoreResources.Requests == nil {
changed = true
s.SegmentStoreResources.Requests = map[corev1.ResourceName]resource.Quantity{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure at this point in the code that s.SegmentStoreResources.Limits exists?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because in ValidateSegmentStoreMemorySettings method we are first checking whether p.Spec.Pravega.SegmentStoreResources.Limits == nil and in turn making sure user does pass Limits

"autoScale.tokenSigningKey": "secret",
"pravega.client.auth.token": "YWRtaW46MTExMV9hYWFh",
"pravega.client.auth.method": "Basic",
"pravegaservice.container.count": "4",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, this was still with the old nomenclature,

Signed-off-by: Nishant Gupta <Nishant_Gupta3@dell.com>
Signed-off-by: Nishant Gupta <Nishant_Gupta3@dell.com>
Signed-off-by: Nishant Gupta <Nishant_Gupta3@dell.com>
Signed-off-by: Nishant Gupta <Nishant_Gupta3@dell.com>
Signed-off-by: Nishant Gupta <Nishant_Gupta3@dell.com>
Signed-off-by: Nishant Gupta <Nishant_Gupta3@dell.com>
@nishant-yt nishant-yt force-pushed the issue-495-validate-ss-controller-settings branch from 0ef2886 to 6783fe3 Compare November 24, 2021 07:38
Copy link
Contributor

@anishakj anishakj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@anishakj anishakj merged commit 076c05e into master Nov 24, 2021
@nishant-yt nishant-yt deleted the issue-495-validate-ss-controller-settings branch November 24, 2021 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate Pravega manifest settings before deployment.
5 participants