-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic APM Server: Set status.ObservedGeneration from metadata.Generation #5470
Conversation
Adjust reconciler to always attempt a status update, even on error. Add generation check to e2e APMServer tests.
Update tests for new behavior.
…dgeneration increment. Debugging e2e tests
run full pr build |
Fyi, e2e tests are failing with new 7-8x upgrade test dealing with mutations associated with these changes, and some additional issues associated with missing roles. If things continue to fail, I'll change the mutation test to not upgrade, but just adjust some labels or something to speed this up. |
…er has needed permissions before attempting to ingest events.
run full pr build |
Remove unneeded context from test
run/e2e-tests |
run/e2e-tests |
jenkins test this please |
run/e2e-tests |
run/e2e-tests |
I can't reproduce e2e test failure for |
run/e2e-tests |
// It corresponds to the metadata generation, which is updated on mutation by the API Server. | ||
// If the generation observed in status diverges from the generation in metadata, the APM Server | ||
// controller has not yet processed the changes contained in the APM Server specification. | ||
// +kubebuilder:validation:Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I ever noticed it, but there are 2 ways to specify a field as optional: +kubebuilder:validation:Optional
and +optional
. Only the former seems to be documented.
// non-existing indexes (ES >= 8.x). This fixes a transient issue that happens when upgrading | ||
// Elasticsearch between major versions where it takes a bit of time to transition between | ||
// file-based user roles, and permissions errors were being returned by Elasticsearch. | ||
func (c *apmClusterChecks) CheckIndexCreation(apm apmv1.ApmServer, k *test.K8sClient) test.Step { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is still failing when run locally (TestAPMServerVersionUpgradeToLatest8x
from 7.17.1
to 8.1.1
)
Looking at the APM Server logs I can see 2 security related errors at the beginning, then I think the test is able to create the events, but it still eventually fails.
{"log.level":"error","@timestamp":"2022-04-28T08:03:44.099Z","log.logger":"modelindexer","log.origin":{"file.name":"modelindexer/indexer.go","file.line":395},"message":"failed to index event (security_exception): action [indices:admin/auto_create] is unauthorized for user [e2e-mercury-apmserver-upgrade-49wz-apm-user] with roles [apm_system,eck_apm_user_role_v75,ingest_admin] on indices [logs-apm.error-default], this action is granted by the index privileges [auto_configure,create_index,manage,all]","service.name":"apm-server","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-04-28T08:03:44.099Z","log.logger":"modelindexer","log.origin":{"file.name":"modelindexer/indexer.go","file.line":395},"message":"failed to index event (security_exception): action [indices:admin/auto_create] is unauthorized for user [e2e-mercury-apmserver-upgrade-49wz-apm-user] with roles [apm_system,eck_apm_user_role_v75,ingest_admin] on indices [metrics-apm.app.1234_service_12a3-default], this action is granted by the index privileges [auto_configure,create_index,manage,all]","service.name":"apm-server","ecs.version":"1.6.0"}
=== RUN TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01
Retries (5m0s timeout): .
step.go:43:
Error Trace: utils.go:88
Error: Received unexpected error:
timeout reached after 5m0s
Test: TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01
{"log.level":"error","@timestamp":"2022-04-28T10:08:40.998+0200","message":"stopping early","service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:44\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutations\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:39\ngithub.com/elastic/cloud-on-k8s/test/e2e/apm.TestAPMServerVersionUpgradeToLatest8x\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/test/e2e/apm/upgrade_test.go:32\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1439"}
--- FAIL: TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01 (300.00s)
I'm still trying to understand if I'm doing something wrong...
metricIndex = "metrics-apm.app.1234_service_12a3-default" | ||
errorIndex = "logs-apm.error-default" | ||
} | ||
return retry.UntilSuccess(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand why we are retrying during 5 minutes? What external event would unlock this situation? My understanding is that if assertCountIndexEqual
fails, it is likely because checkEventsAPI
(c.apmClient.IntakeV2Events
) has also failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this update will show what's going on in some detail.
If I move the retry around this block and add some logging, we can see what failures we're working with
return retry.UntilSuccess(func() error {
if err := c.checkEventsAPI(apm); err != nil {
fmt.Printf("checkEventsApi failure: %s\n", err)
return err
}
if err := c.checkEventsInElasticsearch(apm, k); err != nil {
fmt.Printf("checkEventsInElasticsearch failure: %s\n", err)
return err
}
return nil
}, 5*time.Minute, 10*time.Second)
With this change, here's the result. It fails before even attempting the version upgrade.
1st attempt: The messages are sent to apmserver successfully, but they haven't made it all the way to ES yet, as the metric index hasn't even been created yet.
2nd attempt: Before we attempt to send another message, we count the messages in the index (metric) and find 1 (which likely is the first message we attempted to send. Messages sent successfully to apmserver again, but we are expecting 2 messages in the index, but only find one (again).
3rd attempt: This process keeps repeating.
Seems like we aren't waiting long enough for the message to get from apm -> ES.
EventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-rxzf-es-internal-http.e2e-mercury.svc:9200/apm-7.17.1-metric-2017.05.30/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [apm-7.17.1-metric-2017.05.30] Type:index_not_found_exception RootCause:[{Reason:no such index [apm-7.17.1-metric-2017.05.30] Type:index_not_found_exception}]}}\n"}
{"Time":"2022-04-28T10:44:56.167184-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 2 documents expected in index apm-7.17.1-metric-2017.05.30, got 1 instead\n"}
{"Time":"2022-04-28T10:45:06.47493-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 3 documents expected in index apm-7.17.1-metric-2017.05.30, got 2 instead\n"}
{"Time":"2022-04-28T10:45:16.885698-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 4 documents expected in index apm-7.17.1-metric-2017.05.30, got 3 instead\n"}
{"Time":"2022-04-28T10:45:27.190003-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 5 documents expected in index apm-7.17.1-metric-2017.05.30, got 4 instead\n"}
{"Time":"2022-04-28T10:45:37.492295-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 6 documents expected in index apm-7.17.1-metric-2017.05.30, got 5 instead\n"}
{"Time":"2022-04-28T10:45:47.799507-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 7 documents expected in index apm-7.17.1-metric-2017.05.30, got 6 instead\n"}
{"Time":"2022-04-28T10:45:58.102207-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 8 documents expected in index apm-7.17.1-metric-2017.05.30, got 7 instead\n"}
{"Time":"2022-04-28T10:46:08.396236-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 9 documents expected in index apm-7.17.1-metric-2017.05.30, got 8 instead\n"}
If I change things, and retry the counting of the indexes after dropping messages in apmserver, things succeed again.
if err := c.checkEventsAPI(apm); err != nil {
fmt.Printf("checkEventsApi failure: %s\n", err)
return err
}
return retry.UntilSuccess(func() error {
if err := c.checkEventsInElasticsearch(apm, k); err != nil {
fmt.Printf("checkEventsInElasticsearch failure: %s\n", err)
return err
}
return nil
}, 5*time.Minute, 10*time.Second)
Errors during this run
Before version upgrade we are missing the index, but the retry after 10 seconds succeeds.
{"Time":"2022-04-28T11:56:34.904163-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"Retries (30m0s timeout): .checkEventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-hdgx-es-internal-http.e2e-mercury.svc:9200/apm-7.17.1-metric-2017.05.30/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [apm-7.17.1-metric-2017.05.30] Type:index_not_found_exception RootCause:[{Reason:no such index [apm-7.17.1-metric-2017.05.30] Type:index_not_found_exception}]}}\n"}
After the version upgrade, we get some failures, but eventually succeed
1: index initially not found
{"Time":"2022-04-28T12:05:02.499518-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01","Output":"Retries (30m0s timeout): .checkEventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-hdgx-es-internal-http.e2e-mercury.svc:9200/metrics-apm.app.1234_service_12a3-default/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception RootCause:[{Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception}]}}\n"}
This repeats from 2022-04-28T 12:05:02.499518-05:00
until 2022-04-28T 12:10:06.010821-05:00
(about 5 mins, which is the time for ES to re-read it's file-based users/roles IIRC) when it finally succeeds.
Attempt 3
Sleep for 10 seconds in between inserting message(s), and checking to see if they made it into ES.
return retry.UntilSuccess(func() error {
if err := c.checkEventsAPI(apm); err != nil {
fmt.Printf("checkEventsApi failure: %s\n", err)
return err
}
time.Sleep(10*time.Second)
if err := c.checkEventsInElasticsearch(apm, k); err != nil {
fmt.Printf("checkEventsInElasticsearch failure: %s\n", err)
return err
}
return nil
}, 5*time.Minute, 10*time.Second)
Before version upgrade, the first attempt to insert + read messages through apmserver succeeds without failure.
After version upgrade, we get 2x index not found failures, and then it succeeds within < 1 minute (did we just get lucky with timing that it didn't take 5 mins?)
{"Time":"2022-04-28T13:37:28.722225-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01","Output":"Retries (30m0s timeout): .checkEventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-nkvk-es-internal-http.e2e-mercury.svc:9200/metrics-apm.app.1234_service_12a3-default/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception RootCause:[{Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception}]}}\n"}
{"Time":"2022-04-28T13:37:49.022055-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01","Output":"checkEventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-nkvk-es-internal-http.e2e-mercury.svc:9200/metrics-apm.app.1234_service_12a3-default/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception RootCause:[{Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception}]}}\n"}
{"Time":"2022-04-28T13:38:09.523725-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01","Output":"\n"}
So from what I see, we have 2 options:
- Retry the attempt to find the messages in ES for a bit until we fail/succeed.
- Sleep a bit between initial count+insert, and the subsequent count to see if the message(s) arrived successfully.
@pebrc @barkbay Please let me know what you think, and which direction you'd prefer me take this, and I'll get it implemented, and we can finish off this PR.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added some debug logs to make it clear what is happening in my case and with your original proposal (516df86):
=== RUN TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01
Retries (5m0s timeout):
04-29-2022 13:43:54 - [DEBUG] Entering checkEventsAPI(...)
04-29-2022 13:43:56 - [DEBUG] State before injecting events: c.metricIndexCount=0 c.errorIndexCount=0
04-29-2022 13:43:56 - [DEBUG] Injecting event using c.apmClient.IntakeV2Events
04-29-2022 13:43:56 - [DEBUG] Leaving checkEventsAPI without error, c.apmClient.IntakeV2Events seems to be successful
04-29-2022 13:43:56 - [DEBUG] Entering checkEventsInElasticsearch(...)
04-29-2022 13:43:56 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 13:43:56 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found: no such index [metrics-apm.app.1234_service_12a3-default]
.... 5 minutes later, still the same error ...
04-29-2022 13:48:48 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 13:48:48 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found: no such index [metrics-apm.app.1234_service_12a3-default]
step.go:43:
Error Trace: utils.go:88
Error: Received unexpected error:
timeout reached after 5m0s
The index is never created, checkEventsAPI
is never retried.
I revisited the original retry policy (based on 516df86) to give a chance to checkEventsAPI(...)
to be retried:
--- a/test/e2e/test/apmserver/checks_apm.go
+++ b/test/e2e/test/apmserver/checks_apm.go
@@ -246,7 +246,9 @@ func (c *apmClusterChecks) checkEventsInElasticsearch(apm apmv1.ApmServer, k *te
}
return assertCountIndexEqual(c.esClient, errorIndex, c.errorIndexCount+1)
- }, 5*time.Minute, 10*time.Second)
+ // We need some retries here as the events might not have been indexed in Elasticsearch.
+ // timeout must be less than the parent retry policy, but long enough to let events flow to Elasticsearch.
+ }, 30*time.Second, 2*time.Second)
}
With that change applied it works after 3 calls to checkEventsAPI()
:
# First attempt: KO with `404`, even after the updated timeout of `30*time.Second` index is not created
Retries (5m0s timeout): .04-29-2022 12:57:21 - [DEBUG] Entering checkEventsAPI(...)
04-29-2022 12:57:23 - [DEBUG] State before injecting events: c.metricIndexCount=0 c.errorIndexCount=0
04-29-2022 12:57:23 - [DEBUG] Injecting event using c.apmClient.IntakeV2Events
04-29-2022 12:57:23 - [DEBUG] Leaving checkEventsAPI without error, c.apmClient.IntakeV2Events seems to be successful
04-29-2022 12:57:23 - [DEBUG] Entering checkEventsInElasticsearch(...)
04-29-2022 12:57:23 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:57:23 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]
04-29-2022 12:57:25 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
[ ... last error x 12 ... ]
04-29-2022 12:57:52 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:57:52 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]
# Second attempt: same problem, KO with `404`
.04-29-2022 12:57:56 - [DEBUG] Entering checkEventsAPI(...)
04-29-2022 12:57:56 - [DEBUG] State before injecting events: c.metricIndexCount=0 c.errorIndexCount=0
04-29-2022 12:57:56 - [DEBUG] Injecting event using c.apmClient.IntakeV2Events
04-29-2022 12:57:56 - [DEBUG] Leaving checkEventsAPI without error, c.apmClient.IntakeV2Events seems to be successful
04-29-2022 12:57:56 - [DEBUG] Entering checkEventsInElasticsearch(...)
04-29-2022 12:57:56 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:57:56 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]
04-29-2022 12:57:58 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
[ last error x 12]
04-29-2022 12:58:25 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:58:25 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]
# Third attempt: OK after 1 `404`
04-29-2022 12:58:29 - [DEBUG] Entering checkEventsAPI(...)
04-29-2022 12:58:29 - [DEBUG] State before injecting events: c.metricIndexCount=0 c.errorIndexCount=0
04-29-2022 12:58:29 - [DEBUG] Injecting event using c.apmClient.IntakeV2Events
04-29-2022 12:58:30 - [DEBUG] Leaving checkEventsAPI without error, c.apmClient.IntakeV2Events seems to be successful
04-29-2022 12:58:30 - [DEBUG] Entering checkEventsInElasticsearch(...)
04-29-2022 12:58:30 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:58:30 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]
04-29-2022 12:58:32 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:58:32 - [DEBUG] 1 documents expected in index metrics-apm.app.1234_service_12a3-default, got 0 instead
04-29-2022 12:58:34 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:58:34 - [DEBUG] leaving assertCountIndexEqual without error
04-29-2022 12:58:34 - [DEBUG] entering assertCountIndexEqual for index logs-apm.error-default, expecting 1 events
04-29-2022 12:58:34 - [DEBUG] leaving assertCountIndexEqual without error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from what I see, we have 2 options:
A third option would be to say that despite our efforts it is very hard to state that an APM upgrade is successful, we could also use an annotation in the podTemplate
to trigger a restart and validate that observedGeneration
is updated as expected.
about 5 mins, which is the time for ES to re-read it's file-based users/roles IIRC
I would be really surprised if ES only reloads these files every 5 minutes (that would not match my experience).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about 5 mins, which is the time for ES to re-read it's file-based users/roles IIRC
I would be really surprised if ES only reloads these files every 5 minutes (that would not match my experience).
@barkbay https://www.elastic.co/guide/en/elasticsearch/reference/current/file-realm.html#file-realm
bottom of page, every 5 seconds
. Well, I was quite the order of magnitude off...
I'm going to update this with these changes
}, 30*time.Second, 2*time.Second)
as you're suggesting, and run some additional tests.
Put back tracing error captures.
run/e2e-tests |
run/e2e-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I tested TestAPMServerVersionUpgradeToLatest8x
twice, a first time to upgrade to 8.1.1
, second time to upgrade to 8.2.0
, and it worked as expected 👍
…ation (elastic#5470) * apm observed generation changes. Adjust reconciler to always attempt a status update, even on error. Add generation check to e2e APMServer tests. * Ensure apm status is updated when associations are broken. Update tests for new behavior. * Capitalization issues * Adjust e2e tests for apmserver to support mutations and check observedgeneration increment. Debugging e2e tests * Debugging e2e test failures * remove todo * nolint validation function * remove unused erroring fake k8s client * remove fakeK8sClient from apm testing * Fix spelling * wip * Avoid mutating the original apmserver's configuration during e2e tests * Add step to Apm Server CheckStackTestSteps to ensure ES APM Server user has needed permissions before attempting to ingest events. * Add description to CheckIndexCreation step. * Lint issues * Cleanup of some unneeded logging messages and comments. * remove duplicative recorder, dynamicwatches, and parameters * wip * Remove confusing defer bits from apmserver reconciliation * Upcase Kubernetes Remove unneeded context from test * move associations check inside doReconcile * Wrap erorrs, don't log in e2e test. Use NewElasticsearchClientWithUser to get client, don't create new one manually. * Move CheckIndexCreation closer to where it's used. * Use implemented deepcopy instead of copying the internal map in apm builder use returned results, instead of newly named results * remove unneeded log import/var * Move index variable closer to where it's used. Change how variables are initialized in e2e test. * Add without integration check piece to original builder. * Allow counting of documents in index during upgrade to function properly. * just return the assert * Adding apm tracing back to particular error. Removing request from doReconcile, and NewState, as it's unused. Using t.Helper Adding additional comments to the common Driver Interface. Renaming e2e check function. Fix misspelling. * remove unused result from state * just leveraging eventually for the apm server upgrade etest. Optimizing apm server upgrade test. * just return func * remove retry.UntilSuccess in CheckEventsInElasticsearch e2e test * Move retry.UntilSuccess around the assert calls only. * Add type check to satisfy linter. * Add WithoutIntegrationCheck to smoke tests for apm server builder * Latest released version, not latest snapshot * If no ES ref exists, just skip the step where you test for ES data in index. * Add WithoutIntegrationCheck to smoke tests for apm server builder (again) * Move retry logic to include checkEventsAPI * Expicitly return empty result. Put back tracing error captures. * minimize the total time, and retry time. * Adjust retry to just retry the check in ES.
related to #3392
related to #5331
This change allows status.ObservedGeneration to be kept in sync with metadata.Generation according to the rules described in the above issue. This change adds a set of unit tests which show more details into when observedGeneration will, and will not be updated according to any errors that may occur during reconciliation. End to end tests were adjusted to always check for observedGeneration increments during any mutation changes.