Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added alternative expression field in AWS CloudWatch scaler #2911

Closed
wants to merge 0 commits into from

Conversation

dekelev
Copy link
Contributor

@dekelev dekelev commented Apr 15, 2022

Added alternative expression field in AWS CloudWatch scaler.
Here's an example ScaledObject config:

awsRegion: us-east-1
expression: >-
  SELECT MIN(MessageCount) FROM "AWS/AmazonMQ" WHERE Broker =
  'production' and Queue = 'worker'
metricCollectionTime: '60'
metricName: WorkerMessageCount
metricStatPeriod: '60'
minMetricValue: '0'
namespace: AWS/AmazonMQ
targetMetricValue: '100'

In addition, AWS CloudWatch scaler was changed to use int64 instead of float64 for metric type, since HPA doesn't work with float numbers, integers only.

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO - learn more)
  • Tests have been added
  • A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • A PR is opened to update the documentation on (repo) (if applicable)
  • Changelog has been updated and is aligned with our changelog requirements

Fixes #

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general (thanks for this feature), I have left one inline comment (and it is probably the reason because test are failing).
Apart from the inline comment, there are some things that need to be done to do this PR mergeable.

@@ -288,19 +281,27 @@ func (c *awsCloudwatchScaler) GetMetrics(ctx context.Context, metricName string,

metric := external_metrics.ExternalMetricValue{
MetricName: metricName,
Value: *resource.NewQuantity(int64(metricValue), resource.DecimalSI),
Value: *resource.NewQuantity(metricValue*100, resource.DecimalSI),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you doing this? Multiplying this value * 100 changes current behaviour, doesn't?
I'd say that using an expression, this multiplier can be applied inside the expression, WDYT?

Copy link
Contributor Author

@dekelev dekelev Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JorTurFer When debugging the scaler I've noticed that values are returned as whole numbers from the CloudWatch lib and are matching the value that I get when running the query directly in CloudWatch UI. After passing the metric to HPA, k8s stats shows that the value is not interpreted correctly and indeed auto-scale wasn't scaling up as expected, but multiplying it by 100 solved the issue. I haven't tested this scaler with a query builder, but I find it hard to believe that HPA would somehow work fine with the same value, just because it was fetched with a query builder in oppose to using expression.

If we just want to avoid breaking-change in all cost, then I'll move it into the expression logic, right after it is fetched from the CloudWatch lib (I've tried embedding it directly in the query, but the query just fails to run).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point about the scaler behaviour because it's a fact that casting float64 into int64, we are losing information and multiplying it by 100, the information missed is less, but there are use cases where that situation is not a problem.
Apart from that, IMO we should maintain the value exposed by the upstream as it is given. This new feature gives to the users the option of multiplying the value as part of the expression, doesn't?

Copy link
Contributor Author

@dekelev dekelev Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to multiply by 100 even when the scaler received an integer value from CloudWatch (e.g. values: ['12345'], something like that). My query that I'm testing with is actually getting the max value, so it returns integers, since it's the max number of messages that were in queue over a period of a minute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that you are getting 123,45 when you should receive 12345?

Copy link
Member

@JorTurFer JorTurFer Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and that's what I tried to explain. The value that you can see in the HPA is the result of the formula (returnedValue/podCount). You cannot compare it with the value from the upstream directly, you need to apply the same formula.

You are doing:

  • CloudWatch -> 40
  • KEDA -> 4k (40 * 100)
  • HPA -> 80 (4000/50)

This is wrong because the expected values are:

  • CloudWatch -> 40
  • KEDA -> 40
  • HPA -> 800m (40/50) *Remember that in k8s, decimal numbers don't exist, 0.8 is translated into mili scale so 0.8 -> 800m

You shouldn't multiply the upstream value * 100 because the behavior changes.
When I said to multiply inside the expression, I though that your case was related with working with rates between 0 and 1 and you need to do it because the information lost during the casting to int64.
If that'd be your case, multiplying it for 100 in the expression and taking that into account in the threshold could solve the problem, but that's not your case

Copy link
Contributor Author

@dekelev dekelev Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so what is targetAverageValue actually referring to? I wonder how can I set it up correctly? Is it 800m that I'm comparing it now?

In my example, it's 200 which seems to refer to HPA -> 80 (4000/50).
So if the HPA -> 800m (40/50), then should it be 2 (2000m)? I think I tried that before and it worked. I just couldn't understand the numbers and it somehow made more sense with multiplying the metric by 100.

Anyway, I think have everything to complete the PR. Thanks a lot for spending your time to explain this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targetAverageValue is the desired value per pod. Imagine that your workload can process one message at once and you want to process all messages faster as possible, in that case, you could set targetAverageValue to 1, to have 1 pod per message.
In the same case, if your pod can process 2 messages at once (and you want to process them faster as possible with less resources as possible), the targetAverageValue value will be 2.

if 10 messages are pending , with targetAverageValue 1 you will have 10 pods (10/1=10) and with 2 you will have 5 pods (10/2=5)

No worries about the spent time, I'm happy to help 😄

Let me know if you have any doubt about how it works

Copy link
Contributor Author

@dekelev dekelev Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks. Now it makes sense. I'm hoping to get the tests & docs changes done in the next few days. Have a great day!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, no rush at all
Have a great day too

@dekelev
Copy link
Contributor Author

dekelev commented Apr 18, 2022

I see that the scaler has Go test file (aws_cloudwatch_scaler_test.go), but the contributing testing guide here refers to testing it with TypeScript.

Is there any guide on how to test a single scaler Go file? Is JS used for e2e tests while Go for Unit tests?
I tried testing it with go test -v [file_path], but it throws errors.

@JorTurFer
Copy link
Member

JorTurFer commented Apr 18, 2022

I see that the scaler has Go test file (aws_cloudwatch_scaler_test.go), but the contributing testing guide [here]> > >(https://github.com/dekelev/keda/blob/main/CONTRIBUTING.md?rgh-link-date=2022-04-18T09%3A34%3A20Z#testing) refers to testing it with TypeScript.

Is there any guide on how to test a single scaler Go file? Is JS used for e2e tests while Go for Unit tests?
I tried testing it with go test -v [file_path], but it throws errors.

You are right, that doc refers to use typescript but for e2e tests. Right now, e2e test are written in TS and unit tests are written in Golang. We are working on unifying this, but for the moment the working flow is that.

Related with executing the tests from that file alone, AFAIK in Golang you cannot specify a single file, but you can specify the test names that you want to execute, so you can run them doing:

go test -run ^(TestCloudwatchParseMetadata|TestAWSCloudwatchGetMetricSpecForScaling|TestAWSCloudwatchScalerGetMetrics|TestComputeQueryWindow)$ github.com/kedacore/keda/v2/pkg/scalers

@dekelev
Copy link
Contributor Author

dekelev commented May 4, 2022

@JorTurFer I've added tests, signed the commit and synced my main branch with keda. After that, I noticed that the PR got closed, not sure why. It looks like I've done it somehow. Can you please re-open it?

@JorTurFer
Copy link
Member

JorTurFer commented May 4, 2022

I cannot because the branch is deleted after this force push
Maybe you have done it by error :/
The good news it that the work is not lost because we still have the old status available. Maybe you can create another PR in another branch just porting the changes (to ensure that we don't lose the work, TBH IDK how GH manages it)

@dekelev
Copy link
Contributor Author

dekelev commented May 4, 2022

@JorTurFer OK, thanks. I've created PR #2997 to resume this. Can you please verify that commits are signed as expected and review the changes to the tests? I'll PR the docs change later.

@JorTurFer
Copy link
Member

Yes, we can continue the discussion there 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants