Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic APM Server: Set status.ObservedGeneration from metadata.Generation #5470

Merged
merged 47 commits into from
May 5, 2022

Conversation

naemono
Copy link
Contributor

@naemono naemono commented Mar 14, 2022

related to #3392
related to #5331

This change allows status.ObservedGeneration to be kept in sync with metadata.Generation according to the rules described in the above issue. This change adds a set of unit tests which show more details into when observedGeneration will, and will not be updated according to any errors that may occur during reconciliation. End to end tests were adjusted to always check for observedGeneration increments during any mutation changes.

naemono added 3 commits March 11, 2022 14:08
Adjust reconciler to always attempt a status update, even on error.
Add generation check to e2e APMServer tests.
@botelastic botelastic bot added the triage label Mar 14, 2022
@naemono naemono added >enhancement Enhancement of existing functionality >feature Adds or discusses adding a feature to the product labels Mar 14, 2022
@botelastic botelastic bot removed the triage label Mar 14, 2022
@naemono
Copy link
Contributor Author

naemono commented Mar 14, 2022

run full pr build

@naemono
Copy link
Contributor Author

naemono commented Mar 15, 2022

Fyi, e2e tests are failing with new 7-8x upgrade test dealing with mutations associated with these changes, and some additional issues associated with missing roles. If things continue to fail, I'll change the mutation test to not upgrade, but just adjust some labels or something to speed this up.

@naemono
Copy link
Contributor Author

naemono commented Mar 22, 2022

run full pr build

@naemono naemono marked this pull request as ready for review March 22, 2022 16:19
pkg/controller/apmserver/controller.go Outdated Show resolved Hide resolved
pkg/controller/apmserver/controller.go Outdated Show resolved Hide resolved
pkg/controller/apmserver/controller_test.go Outdated Show resolved Hide resolved
test/e2e/test/apmserver/checks_apm.go Outdated Show resolved Hide resolved
@naemono
Copy link
Contributor Author

naemono commented Apr 20, 2022

run/e2e-tests

@naemono
Copy link
Contributor Author

naemono commented Apr 20, 2022

run/e2e-tests

@thbkrkr thbkrkr added the v2.3.0 label Apr 21, 2022
@naemono
Copy link
Contributor Author

naemono commented Apr 24, 2022

jenkins test this please

@naemono
Copy link
Contributor Author

naemono commented Apr 25, 2022

run/e2e-tests

@naemono
Copy link
Contributor Author

naemono commented Apr 26, 2022

run/e2e-tests

@naemono
Copy link
Contributor Author

naemono commented Apr 27, 2022

I can't reproduce e2e test failure for TestMutationAndReversal/* locally, or when running e2e test fully within a test cluster, and this test has nothing to do with this PR, so I'm going to assume this is a transient failure, and just re-run the test with 🤞 ...

@naemono
Copy link
Contributor Author

naemono commented Apr 27, 2022

run/e2e-tests

// It corresponds to the metadata generation, which is updated on mutation by the API Server.
// If the generation observed in status diverges from the generation in metadata, the APM Server
// controller has not yet processed the changes contained in the APM Server specification.
// +kubebuilder:validation:Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I ever noticed it, but there are 2 ways to specify a field as optional: +kubebuilder:validation:Optional and +optional. Only the former seems to be documented.

pkg/controller/apmserver/controller.go Outdated Show resolved Hide resolved
pkg/controller/apmserver/controller.go Show resolved Hide resolved
// non-existing indexes (ES >= 8.x). This fixes a transient issue that happens when upgrading
// Elasticsearch between major versions where it takes a bit of time to transition between
// file-based user roles, and permissions errors were being returned by Elasticsearch.
func (c *apmClusterChecks) CheckIndexCreation(apm apmv1.ApmServer, k *test.K8sClient) test.Step {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is still failing when run locally (TestAPMServerVersionUpgradeToLatest8x from 7.17.1 to 8.1.1)
Looking at the APM Server logs I can see 2 security related errors at the beginning, then I think the test is able to create the events, but it still eventually fails.

{"log.level":"error","@timestamp":"2022-04-28T08:03:44.099Z","log.logger":"modelindexer","log.origin":{"file.name":"modelindexer/indexer.go","file.line":395},"message":"failed to index event (security_exception): action [indices:admin/auto_create] is unauthorized for user [e2e-mercury-apmserver-upgrade-49wz-apm-user] with roles [apm_system,eck_apm_user_role_v75,ingest_admin] on indices [logs-apm.error-default], this action is granted by the index privileges [auto_configure,create_index,manage,all]","service.name":"apm-server","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-04-28T08:03:44.099Z","log.logger":"modelindexer","log.origin":{"file.name":"modelindexer/indexer.go","file.line":395},"message":"failed to index event (security_exception): action [indices:admin/auto_create] is unauthorized for user [e2e-mercury-apmserver-upgrade-49wz-apm-user] with roles [apm_system,eck_apm_user_role_v75,ingest_admin] on indices [metrics-apm.app.1234_service_12a3-default], this action is granted by the index privileges [auto_configure,create_index,manage,all]","service.name":"apm-server","ecs.version":"1.6.0"}
=== RUN   TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01
Retries (5m0s timeout): .
    step.go:43: 
        	Error Trace:	utils.go:88
        	Error:      	Received unexpected error:
        	            	timeout reached after 5m0s
        	Test:       	TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01
{"log.level":"error","@timestamp":"2022-04-28T10:08:40.998+0200","message":"stopping early","service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:44\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutations\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:39\ngithub.com/elastic/cloud-on-k8s/test/e2e/apm.TestAPMServerVersionUpgradeToLatest8x\n\t/Users/michael/go/src/github.com/elastic/cloud-on-k8s/test/e2e/apm/upgrade_test.go:32\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1439"}
    --- FAIL: TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01 (300.00s)

I'm still trying to understand if I'm doing something wrong...

metricIndex = "metrics-apm.app.1234_service_12a3-default"
errorIndex = "logs-apm.error-default"
}
return retry.UntilSuccess(func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand why we are retrying during 5 minutes? What external event would unlock this situation? My understanding is that if assertCountIndexEqual fails, it is likely because checkEventsAPI (c.apmClient.IntakeV2Events) has also failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this update will show what's going on in some detail.

If I move the retry around this block and add some logging, we can see what failures we're working with

			return retry.UntilSuccess(func() error {
				if err := c.checkEventsAPI(apm); err != nil {
					fmt.Printf("checkEventsApi failure: %s\n", err)
					return err
				}
				if err := c.checkEventsInElasticsearch(apm, k); err != nil {
					fmt.Printf("checkEventsInElasticsearch failure: %s\n", err)
					return err
				}
				return nil
			}, 5*time.Minute, 10*time.Second)

With this change, here's the result. It fails before even attempting the version upgrade.

1st attempt: The messages are sent to apmserver successfully, but they haven't made it all the way to ES yet, as the metric index hasn't even been created yet.
2nd attempt: Before we attempt to send another message, we count the messages in the index (metric) and find 1 (which likely is the first message we attempted to send. Messages sent successfully to apmserver again, but we are expecting 2 messages in the index, but only find one (again).
3rd attempt: This process keeps repeating.

Seems like we aren't waiting long enough for the message to get from apm -> ES.

EventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-rxzf-es-internal-http.e2e-mercury.svc:9200/apm-7.17.1-metric-2017.05.30/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [apm-7.17.1-metric-2017.05.30] Type:index_not_found_exception RootCause:[{Reason:no such index [apm-7.17.1-metric-2017.05.30] Type:index_not_found_exception}]}}\n"}
{"Time":"2022-04-28T10:44:56.167184-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 2 documents expected in index apm-7.17.1-metric-2017.05.30, got 1 instead\n"}
{"Time":"2022-04-28T10:45:06.47493-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 3 documents expected in index apm-7.17.1-metric-2017.05.30, got 2 instead\n"}
{"Time":"2022-04-28T10:45:16.885698-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 4 documents expected in index apm-7.17.1-metric-2017.05.30, got 3 instead\n"}
{"Time":"2022-04-28T10:45:27.190003-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 5 documents expected in index apm-7.17.1-metric-2017.05.30, got 4 instead\n"}
{"Time":"2022-04-28T10:45:37.492295-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 6 documents expected in index apm-7.17.1-metric-2017.05.30, got 5 instead\n"}
{"Time":"2022-04-28T10:45:47.799507-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 7 documents expected in index apm-7.17.1-metric-2017.05.30, got 6 instead\n"}
{"Time":"2022-04-28T10:45:58.102207-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 8 documents expected in index apm-7.17.1-metric-2017.05.30, got 7 instead\n"}
{"Time":"2022-04-28T10:46:08.396236-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"checkEventsInElasticsearch failure: 9 documents expected in index apm-7.17.1-metric-2017.05.30, got 8 instead\n"}

If I change things, and retry the counting of the indexes after dropping messages in apmserver, things succeed again.

			if err := c.checkEventsAPI(apm); err != nil {
				fmt.Printf("checkEventsApi failure: %s\n", err)
				return err
			}
			return retry.UntilSuccess(func() error {
				if err := c.checkEventsInElasticsearch(apm, k); err != nil {
					fmt.Printf("checkEventsInElasticsearch failure: %s\n", err)
					return err
				}
				return nil
			}, 5*time.Minute, 10*time.Second)

Errors during this run

Before version upgrade we are missing the index, but the retry after 10 seconds succeeds.

{"Time":"2022-04-28T11:56:34.904163-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch","Output":"Retries (30m0s timeout): .checkEventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-hdgx-es-internal-http.e2e-mercury.svc:9200/apm-7.17.1-metric-2017.05.30/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [apm-7.17.1-metric-2017.05.30] Type:index_not_found_exception RootCause:[{Reason:no such index [apm-7.17.1-metric-2017.05.30] Type:index_not_found_exception}]}}\n"}

After the version upgrade, we get some failures, but eventually succeed

1: index initially not found

{"Time":"2022-04-28T12:05:02.499518-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01","Output":"Retries (30m0s timeout): .checkEventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-hdgx-es-internal-http.e2e-mercury.svc:9200/metrics-apm.app.1234_service_12a3-default/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception RootCause:[{Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception}]}}\n"}

This repeats from 2022-04-28T 12:05:02.499518-05:00 until 2022-04-28T 12:10:06.010821-05:00 (about 5 mins, which is the time for ES to re-read it's file-based users/roles IIRC) when it finally succeeds.

Attempt 3

Sleep for 10 seconds in between inserting message(s), and checking to see if they made it into ES.

			return retry.UntilSuccess(func() error {
				if err := c.checkEventsAPI(apm); err != nil {
					fmt.Printf("checkEventsApi failure: %s\n", err)
					return err
				}
				time.Sleep(10*time.Second)
				if err := c.checkEventsInElasticsearch(apm, k); err != nil {
					fmt.Printf("checkEventsInElasticsearch failure: %s\n", err)
					return err
				}
				return nil
			}, 5*time.Minute, 10*time.Second)

Before version upgrade, the first attempt to insert + read messages through apmserver succeeds without failure.

After version upgrade, we get 2x index not found failures, and then it succeeds within < 1 minute (did we just get lucky with timing that it didn't take 5 mins?)

{"Time":"2022-04-28T13:37:28.722225-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01","Output":"Retries (30m0s timeout): .checkEventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-nkvk-es-internal-http.e2e-mercury.svc:9200/metrics-apm.app.1234_service_12a3-default/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception RootCause:[{Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception}]}}\n"}
{"Time":"2022-04-28T13:37:49.022055-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01","Output":"checkEventsInElasticsearch failure: elasticsearch client failed for https://apmserver-upgrade-nkvk-es-internal-http.e2e-mercury.svc:9200/metrics-apm.app.1234_service_12a3-default/_count: 404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception RootCause:[{Reason:no such index [metrics-apm.app.1234_service_12a3-default] Type:index_not_found_exception}]}}\n"}
{"Time":"2022-04-28T13:38:09.523725-05:00","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/apm","Test":"TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01","Output":"\n"}

So from what I see, we have 2 options:

  1. Retry the attempt to find the messages in ES for a bit until we fail/succeed.
  2. Sleep a bit between initial count+insert, and the subsequent count to see if the message(s) arrived successfully.

@pebrc @barkbay Please let me know what you think, and which direction you'd prefer me take this, and I'll get it implemented, and we can finish off this PR.

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added some debug logs to make it clear what is happening in my case and with your original proposal (516df86):

=== RUN   TestAPMServerVersionUpgradeToLatest8x/ApmServer_should_accept_event_and_write_data_to_Elasticsearch#01
Retries (5m0s timeout):
04-29-2022 13:43:54 - [DEBUG] Entering checkEventsAPI(...)
04-29-2022 13:43:56 - [DEBUG] State before injecting events: c.metricIndexCount=0 c.errorIndexCount=0
04-29-2022 13:43:56 - [DEBUG] Injecting event using c.apmClient.IntakeV2Events
04-29-2022 13:43:56 - [DEBUG] Leaving checkEventsAPI without error, c.apmClient.IntakeV2Events seems to be successful
04-29-2022 13:43:56 - [DEBUG] Entering checkEventsInElasticsearch(...)
04-29-2022 13:43:56 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 13:43:56 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found: no such index [metrics-apm.app.1234_service_12a3-default]
.... 5 minutes later, still the same error ...
04-29-2022 13:48:48 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 13:48:48 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found: no such index [metrics-apm.app.1234_service_12a3-default]

    step.go:43: 
        	Error Trace:	utils.go:88
        	Error:      	Received unexpected error:
        	            	timeout reached after 5m0s

The index is never created, checkEventsAPI is never retried.
I revisited the original retry policy (based on 516df86) to give a chance to checkEventsAPI(...) to be retried:

--- a/test/e2e/test/apmserver/checks_apm.go
+++ b/test/e2e/test/apmserver/checks_apm.go
@@ -246,7 +246,9 @@ func (c *apmClusterChecks) checkEventsInElasticsearch(apm apmv1.ApmServer, k *te
                }
 
                return assertCountIndexEqual(c.esClient, errorIndex, c.errorIndexCount+1)
-       }, 5*time.Minute, 10*time.Second)
+               // We need some retries here as the events might not have been indexed in Elasticsearch.
+               // timeout must be less than the parent retry policy, but long enough to let events flow to Elasticsearch.
+       }, 30*time.Second, 2*time.Second)
 }

With that change applied it works after 3 calls to checkEventsAPI():

# First attempt: KO with `404`, even after the updated timeout of `30*time.Second` index is not created
Retries (5m0s timeout): .04-29-2022 12:57:21 - [DEBUG] Entering checkEventsAPI(...)
04-29-2022 12:57:23 - [DEBUG] State before injecting events: c.metricIndexCount=0 c.errorIndexCount=0
04-29-2022 12:57:23 - [DEBUG] Injecting event using c.apmClient.IntakeV2Events
04-29-2022 12:57:23 - [DEBUG] Leaving checkEventsAPI without error, c.apmClient.IntakeV2Events seems to be successful
04-29-2022 12:57:23 - [DEBUG] Entering checkEventsInElasticsearch(...)
04-29-2022 12:57:23 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:57:23 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]
04-29-2022 12:57:25 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
[ ... last error x 12 ... ]
04-29-2022 12:57:52 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:57:52 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]

# Second attempt: same problem, KO with `404`
.04-29-2022 12:57:56 - [DEBUG] Entering checkEventsAPI(...)
04-29-2022 12:57:56 - [DEBUG] State before injecting events: c.metricIndexCount=0 c.errorIndexCount=0
04-29-2022 12:57:56 - [DEBUG] Injecting event using c.apmClient.IntakeV2Events
04-29-2022 12:57:56 - [DEBUG] Leaving checkEventsAPI without error, c.apmClient.IntakeV2Events seems to be successful
04-29-2022 12:57:56 - [DEBUG] Entering checkEventsInElasticsearch(...)
04-29-2022 12:57:56 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:57:56 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]
04-29-2022 12:57:58 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
[ last error x 12]
04-29-2022 12:58:25 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:58:25 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]

# Third attempt: OK after 1 `404`
04-29-2022 12:58:29 - [DEBUG] Entering checkEventsAPI(...)
04-29-2022 12:58:29 - [DEBUG] State before injecting events: c.metricIndexCount=0 c.errorIndexCount=0
04-29-2022 12:58:29 - [DEBUG] Injecting event using c.apmClient.IntakeV2Events
04-29-2022 12:58:30 - [DEBUG] Leaving checkEventsAPI without error, c.apmClient.IntakeV2Events seems to be successful
04-29-2022 12:58:30 - [DEBUG] Entering checkEventsInElasticsearch(...)
04-29-2022 12:58:30 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:58:30 - [ERROR] leaving assertCountIndexEqual with error: 404 Not Found:no such index [metrics-apm.app.1234_service_12a3-default]
04-29-2022 12:58:32 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:58:32 - [DEBUG] 1 documents expected in index metrics-apm.app.1234_service_12a3-default, got 0 instead
04-29-2022 12:58:34 - [DEBUG] entering assertCountIndexEqual for index metrics-apm.app.1234_service_12a3-default, expecting 1 events
04-29-2022 12:58:34 - [DEBUG] leaving assertCountIndexEqual without error
04-29-2022 12:58:34 - [DEBUG] entering assertCountIndexEqual for index logs-apm.error-default, expecting 1 events
04-29-2022 12:58:34 - [DEBUG] leaving assertCountIndexEqual without error

Copy link
Contributor

@barkbay barkbay Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from what I see, we have 2 options:

A third option would be to say that despite our efforts it is very hard to state that an APM upgrade is successful, we could also use an annotation in the podTemplate to trigger a restart and validate that observedGeneration is updated as expected.

about 5 mins, which is the time for ES to re-read it's file-based users/roles IIRC

I would be really surprised if ES only reloads these files every 5 minutes (that would not match my experience).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about 5 mins, which is the time for ES to re-read it's file-based users/roles IIRC

I would be really surprised if ES only reloads these files every 5 minutes (that would not match my experience).

@barkbay https://www.elastic.co/guide/en/elasticsearch/reference/current/file-realm.html#file-realm
bottom of page, every 5 seconds. Well, I was quite the order of magnitude off...

I'm going to update this with these changes

}, 30*time.Second, 2*time.Second)

as you're suggesting, and run some additional tests.

@naemono
Copy link
Contributor Author

naemono commented May 4, 2022

run/e2e-tests

@naemono
Copy link
Contributor Author

naemono commented May 4, 2022

run/e2e-tests

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I tested TestAPMServerVersionUpgradeToLatest8x twice, a first time to upgrade to 8.1.1, second time to upgrade to 8.2.0, and it worked as expected 👍

@naemono naemono merged commit 216e4bf into elastic:main May 5, 2022
@naemono naemono deleted the 3392-apm-observedGeneration branch May 5, 2022 13:35
@barkbay barkbay removed the >feature Adds or discusses adding a feature to the product label Jun 13, 2022
fantapsody pushed a commit to fantapsody/cloud-on-k8s that referenced this pull request Feb 7, 2023
…ation (elastic#5470)

* apm observed generation changes.
Adjust reconciler to always attempt a status update, even on error.
Add generation check to e2e APMServer tests.

* Ensure apm status is updated when associations are broken.
Update tests for new behavior.

* Capitalization issues

* Adjust e2e tests for apmserver to support mutations and check observedgeneration increment.
Debugging e2e tests

* Debugging e2e test failures

* remove todo

* nolint validation function

* remove unused erroring fake k8s client

* remove fakeK8sClient from apm testing

* Fix spelling

* wip

* Avoid mutating the original apmserver's configuration during e2e tests

* Add step to Apm Server CheckStackTestSteps to ensure ES APM Server user has needed
  permissions before attempting to ingest events.

* Add description to CheckIndexCreation step.

* Lint issues

* Cleanup of some unneeded logging messages and comments.

* remove duplicative recorder, dynamicwatches, and parameters

* wip

* Remove confusing defer bits from apmserver reconciliation

* Upcase Kubernetes
Remove unneeded context from test

* move associations check inside doReconcile

* Wrap erorrs, don't log in e2e test.
Use NewElasticsearchClientWithUser to get client, don't create new one manually.

* Move CheckIndexCreation closer to where it's used.

* Use implemented deepcopy instead of copying the internal map in apm builder
use returned results, instead of newly named results

* remove unneeded log import/var

* Move index variable closer to where it's used.
Change how variables are initialized in e2e test.

* Add without integration check piece to original builder.

* Allow counting of documents in index during upgrade to function properly.

* just return the assert

* Adding apm tracing back to particular error.
Removing request from doReconcile, and NewState, as it's unused.
Using t.Helper
Adding additional comments to the common Driver Interface.
Renaming e2e check function.
Fix misspelling.

* remove unused result from state

* just leveraging eventually for the apm server upgrade etest.
Optimizing apm server upgrade test.

* just return func

* remove retry.UntilSuccess in CheckEventsInElasticsearch e2e test

* Move retry.UntilSuccess around the assert calls only.

* Add type check to satisfy linter.

* Add WithoutIntegrationCheck to smoke tests for apm server builder

* Latest released version, not latest snapshot

* If no ES ref exists, just skip the step where you test for ES data in index.

* Add WithoutIntegrationCheck to smoke tests for apm server builder (again)

* Move retry logic to include checkEventsAPI

* Expicitly return empty result.
Put back tracing error captures.

* minimize the total time, and retry time.

* Adjust retry to just retry the check in ES.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality v2.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants