Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing ES Promotion: alerting api integration security and spaces enabled Alerts update no_kibana_privileges at space1 should handle updates to an alert schedule by rescheduling the underlying task #71558

Closed
tylersmalley opened this issue Jul 13, 2020 · 7 comments · Fixed by #71632
Labels
blocker failed-es-promotion Feature:Alerting skipped-test Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.9.0 v8.0.0

Comments

@tylersmalley
Copy link
Contributor

This failure is preventing the promotion of the current Elasticsearch nightly snapshot.

main/8.0.0: https://kibana-ci.elastic.co/job/elasticsearch+snapshots+verify/1077/execution/node/381/log/
7x/7.9: https://kibana-ci.elastic.co/job/elasticsearch+snapshots+verify/1076/execution/node/386/log/

For more information on the Elasticsearch snapshot promotion process: https://www.elastic.co/guide/en/kibana/master/development-es-snapshots.html

13:32:17       │1)    alerting api integration security and spaces enabled
13:32:17       │       Alerts
13:32:17       │         update
13:32:17       │           no_kibana_privileges at space1
13:32:17       │             should handle updates to an alert schedule by rescheduling the underlying task:
13:32:17       │
13:32:17       │      retry.try timeout: Error: expected 180059 to be above 1790000
13:32:17       │     at Assertion.assert (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:100:11)
13:32:17       │     at Assertion.greaterThan.Assertion.above (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:317:8)
13:32:17       │     at Function.greaterThan (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:531:15)
13:32:17       │     at ensureDatetimeIsWithinRange (/dev/shm/workspace/kibana/x-pack/test/alerting_api_integration/common/lib/test_assertions.ts:15:22)
13:32:17       │     at retry.try (/dev/shm/workspace/kibana/x-pack/test/alerting_api_integration/security_and_spaces/tests/alerting/update.ts:430:13)
13:32:17       │   Error: retry.try timeout: Error: expected 180059 to be above 1790000
13:32:17       │       at Assertion.assert (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:100:11)
13:32:17       │       at Assertion.greaterThan.Assertion.above (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:317:8)
13:32:17       │       at Function.greaterThan (/dev/shm/workspace/kibana/packages/kbn-expect/expect.js:531:15)
13:32:17       │       at ensureDatetimeIsWithinRange (test/alerting_api_integration/common/lib/test_assertions.ts:15:22)
13:32:17       │       at retry.try (test/alerting_api_integration/security_and_spaces/tests/alerting/update.ts:430:13)
13:32:17       │       at onFailure (/dev/shm/workspace/kibana/test/common/services/retry/retry_for_success.ts:28:9)
13:32:17       │       at retryForSuccess (/dev/shm/workspace/kibana/test/common/services/retry/retry_for_success.ts:68:13)
@tylersmalley tylersmalley added blocker Feature:Alerting v8.0.0 Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.9.0 failed-es-promotion labels Jul 13, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

tylersmalley pushed a commit that referenced this issue Jul 13, 2020
#71558

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
tylersmalley pushed a commit that referenced this issue Jul 13, 2020
#71558

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
@tylersmalley
Copy link
Contributor Author

Skipped:

main/8.0: 439f2dd
7.x/7.9: a5f773e

@tylersmalley
Copy link
Contributor Author

Possibly related to #71559 and/or #71555

@tylersmalley
Copy link
Contributor Author

Confirmed this failure, along with those linked above, were introduced in elastic/elasticsearch@a28ce1e

@pmuellr
Copy link
Member

pmuellr commented Jul 14, 2020

Here's the code for this failing test:

it('should handle updates to an alert schedule by rescheduling the underlying task', async () => {
const { body: createdAlert } = await supertest
.post(`${getUrlPrefix(space.id)}/api/alerts/alert`)
.set('kbn-xsrf', 'foo')
.send(
getTestAlertData({
schedule: { interval: '30m' },
})
)
.expect(200);
objectRemover.add(space.id, createdAlert.id, 'alert', 'alerts');
await retry.try(async () => {
const alertTask = (await getAlertingTaskById(createdAlert.scheduledTaskId)).docs[0];
expect(alertTask.status).to.eql('idle');
// ensure the alert inital run has completed and it's been rescheduled to half an hour from now
ensureDatetimeIsWithinRange(Date.parse(alertTask.runAt), 30 * 60 * 1000);
});

Here's the ensureDatetimeIsWithinRange() function:

export function ensureDatetimeIsWithinRange(
date: number,
expectedDiff: number,
buffer: number = 10000
) {
const diff = date - Date.now();
expect(diff).to.be.greaterThan(expectedDiff - buffer);
expect(diff).to.be.lessThan(expectedDiff + buffer);
}

At first glance, I'm pretty confused by the error message. It's helpful to add commas to the numbers in question:

retry.try timeout: Error: expected 180,059 to be above 1,790,000

and for completeness:

30 * 60 * 1000 = 1,800,000

I didn't realize they were off by a factor of 10x, so thought we could just change that buffer default, since 1790... is so close to 1800.... Welp.

So it would be perfectly explainable if the '30m' interval somehow got interpreted as 3 minutes instead of 30, but it seems like that would break LOTS of stuff. How would a change to ES cause that?

@pmuellr
Copy link
Member

pmuellr commented Jul 14, 2020

Repro with the following:

Remove the .skip from

# FTS
$ cd x-pack
$ KBN_ES_SNAPSHOT_USE_UNVERIFIED=1 node scripts/functional_tests_server.js --config=test/alerting_api_integration/security_and_spaces/config.ts
...

# FTR
$ cd x-pack
$ node ../scripts/functional_test_runner.js --grep "Alerts.update" --config=test/alerting_api_integration/security_and_spaces/config.ts
...

The env var prefix on FTS is supposed to use latest ES. I'm seeing this error in the FTS log, which seems like it's the latest ES, with the issue with unnamed granted API keys:

info [o.e.x.s.a.AuthenticationService] [pmuellr.muellerware.org] Authentication using apikey failed - apikey authentication for id bglrS3MBS3WJPfRCpyTJ encountered a failure
     org.elasticsearch.common.xcontent.XContentParseException: [1:532] [api_key_doc] name doesn't support values of type: VALUE_NULL

Here's where it chokes on the test:

   └-: Alerts
     └-> "before all" hook
     └-: update
       └-> "before all" hook
       └-: no_kibana_privileges at space1
         └-> "before all" hook
         ...
         └-> should handle updates to an alert schedule by rescheduling the underlying task
           └-> "before each" hook: global before each
           │ debg --- retry.try error: expected 'claiming' to sort of equal 'idle'
           │ debg --- retry.try error: expected 'running' to sort of equal 'idle'
           │ debg --- retry.try error: expected 298973 to be above 1790000
           │ debg --- retry.try error: expected 298411 to be above 1790000
           ...
           | debg --- retry.try error: expected 180984 to be above 1790000
           │ debg --- retry.try error: expected 180437 to be above 1790000           

It appears the "close, but off by 10x" for the numbers 180437 and 1790000 is a happy decimally coincidence!

@pmuellr
Copy link
Member

pmuellr commented Jul 14, 2020

That's as far as I got, but here's a guess:

API key failure leaving the alert in a weird state where it's doing an error retry, hence the smaller diff in the runat date. The initial runat diff reported above is about 5 minutes, is that our error retry interval?

$ node -p "298973 / 1000 / 60"
4.9828833333333336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker failed-es-promotion Feature:Alerting skipped-test Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.9.0 v8.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants