Run system tests in parallel #1909

mrodm · 2024-06-13T17:44:00Z

Depends on elastic/package-spec#759
Part of #787

Run system tests in parallel:

Created new methods to trigger in parallel (using routines)
Added new global test configuration files to define whether or not a package can be run in parallel, or all tests of a given test type should be skipped.
- By default, tests are run sequentially
Added environment variables to define independent agents and how many routines can be triggered in parallel
- Updated nginx and apache packages to run system tests in parallel.
~~Reordered when the shutdownAgentHandler is set, just after setting up the Agent~~
- Required to keep the same order while agents could be created in servicedeployer package (custom agents or k8s agents).
~~Changed how shutdown service handler is set, even if there is an error~~ Reverted

Checklist

New version of package-spec containing Add main configuration file for tests package-spec#759 - PR for release Prepare release 3.2.0 package-spec#764

How to test

Add in a package a global test configuration file

system:
  parallel: true

Run system tests

elastic-package stack up -v -d

cd test/packages/parallel/nginx/

ELASTIC_PACKAGE_MAXIMUM_NUMBER_PARALLEL_TESTS=3 ELASTIC_PACKAGE_TEST_ENABLE_INDEPENDENT_AGENT=true elastic-package test system -v

cd test/packages/parallel/apache/

ELASTIC_PACKAGE_MAXIMUM_NUMBER_PARALLEL_TESTS=3 ELASTIC_PACKAGE_TEST_ENABLE_INDEPENDENT_AGENT=true elastic-package test system -v


elastic-package stack down -v

internal/testrunner/globaltestconfig.go

mrodm · 2024-06-13T17:48:28Z

internal/testrunner/runners/system/tester.go


 	partial, err := r.runTest(ctx, testConfig, svcInfo)

 	tdErr := r.tearDownTest(ctx)
 	if err != nil {
-		return partial, err
+		return partial, errors.Join(err, fmt.Errorf("failed to tear down runner: %w", tdErr))


Not able yet to show the teardown error when the context was cancelled (pressing Ctrl+C).

As it was previously, deleting test policies handler was failing but the error did not raised

^C2024/06/13 19:32:38 DEBUG deleting test policies... 2024/06/13 19:32:38 INFO Signal caught! 2024/06/13 19:32:38 DEBUG POST https://127.0.0.1:5601/api/fleet/agent_policies/delete 2024/06/13 19:32:39 DEBUG POST https://127.0.0.1:5601/api/fleet/agent_policies/delete 2024/06/13 19:32:39 DEBUG Failed deleting test policy handler...: error cleaning up test policy: could not delete policy; API status code = 400; response body = {"statusCode":400,"error":"Bad Request","message":"Cannot delete an agent policy that is assigned to any active or inactive agents"} 2024/06/13 19:32:39 ERROR context error: context canceled 2024/06/13 19:32:39 ERROR context error: context canceled 2024/06/13 19:32:39 ERROR context error: context canceled 2024/06/13 19:32:39 DEBUG Uninstalling package... 2024/06/13 19:32:39 DEBUG GET https://127.0.0.1:5601/api/fleet/epm/packages/nginx 2024/06/13 19:32:40 DEBUG DELETE https://127.0.0.1:5601/api/fleet/epm/packages/nginx/999.999.999 interrupted Command exited with non-zero status 130

Errors caused by interruptions are not shown by intention, they use to be long chains of wrapped errors not very useful.

We capture interruptions here:

elastic-package/main.go

Lines 27 to 30 in ebb1446

if errIsInterruption(err) {

rootCmd.Println("interrupted")

os.Exit(130)

}

If both the test and the tear down fail, I think we are fine with returning only the test error.

Ok, understood. I'll remove then this errors.Join

Thanks!

mrodm · 2024-06-13T17:49:15Z

internal/testrunner/runners/system/tester.go

+	r.dataStreamManifest, err = packages.ReadDataStreamManifest(filepath.Join(r.dataStreamPath, packages.DataStreamManifestFile))
+	if err != nil {
+		return nil, fmt.Errorf("reading data stream manifest failed: %w", err)
+	}
+
+	// Temporarily until independent Elastic Agents are enabled by default,
+	// enable independent Elastic Agents if package defines that requires root privileges
+	if pkg, ds := r.pkgManifest, r.dataStreamManifest; pkg.Agent.Privileges.Root || (ds != nil && ds.Agent.Privileges.Root) {
+		r.runIndependentElasticAgent = true
+	}
+
+	// If the environment variable is present, it always has preference over the root
+	// privileges value (if any) defined in the manifest file
+	v, ok := os.LookupEnv(enableIndependentAgents)
+	if ok {
+		r.runIndependentElasticAgent = strings.ToLower(v) == "true"
+	}


Added to the constructor so it can be queried whether or not it is running with independent Elastic Agents.

mrodm · 2024-06-13T17:51:44Z

internal/testrunner/runners/system/tester.go

 	if err != nil {
 		return nil, err
 	}

 	scenario.agent = agentDeployed

-	service, svcInfo, err := r.setupService(ctx, config, serviceOptions, svcInfo, agentInfo, agentDeployed, policy, serviceStateData)


Delay setting up service, after setting the handlers related to the Agent

shutdownAgentHandler (this is set within setupAgent method)

removeAgentHandler

resetAgentPolicyHandler

resetAgentLogLevelHandler

It cannot be done this change.
In servicedeployer package, there are some agents created there (custom agents and K8s agents). And therefore, it is required to call checkEnrolledAgents after setupService.

At least, until it is kept support for those scenarios using Elastic Agent from stack.

mrodm · 2024-06-13T17:53:02Z

internal/servicedeployer/terraform.go

@@ -149,7 +149,7 @@ func (tsd TerraformServiceDeployer) SetUp(ctx context.Context, svcInfo ServiceIn
 	}
 	err = p.Up(ctx, opts)
 	if err != nil {
-		return nil, fmt.Errorf("could not boot up service using Docker Compose: %w", err)
+		return &service, fmt.Errorf("could not boot up service using Docker Compose: %w", err)


Maybe these changes could be done in a different PR.

I was thinking that it would be interesting to return also the service here in case of error, to be able to run the shutdownServiceHandler.

The same if there is any error while waiting to be healthy the container.

WDYT ?

Removed for now these changes.

mrodm · 2024-06-13T17:55:44Z

internal/compose/compose.go

+				logger.Debugf("Container %s status: %s (no health status)", containerDescription.ID, containerDescription.State.Status)
 				continue
 			}

 			// Service is up and running and it's healthy
 			if containerDescription.State.Status == "running" && containerDescription.State.Health.Status == "healthy" {
+				logger.Debugf("Container %s status: %s (health: %s)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.Health.Status)
 				continue
 			}

 			// Container started and finished with exit code 0
 			if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode == 0 {
+				logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
 				continue
 			}

 			// Container exited with code > 0
 			if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode > 0 {
+				logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
 				return fmt.Errorf("container (ID: %s) exited with code %d", containerDescription.ID, containerDescription.State.ExitCode)
 			}

 			// Any different status is considered unhealthy
+			logger.Debugf("Container %s status: unhealthy", containerDescription.ID)


Examples of logs withthese changes:

2024/06/13 19:32:35 DEBUG Wait for healthy containers: 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c 2024/06/13 19:32:35 DEBUG output command: /usr/bin/docker inspect 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c 2024/06/13 19:32:35 DEBUG Container 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c status: unhealthy 2024/06/13 19:32:36 DEBUG Wait for healthy containers: 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c 2024/06/13 19:32:36 DEBUG output command: /usr/bin/docker inspect 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c 2024/06/13 19:32:36 DEBUG Container 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c status: running (health: healthy)

Previously:

2024/06/13 15:43:06 DEBUG Wait for healthy containers: 2306f9dd160408ce2e26b72565d1d3037153de14167052733f2cfd553f212de4 2024/06/13 15:43:06 DEBUG output command: /usr/bin/docker inspect 2306f9dd160408ce2e26b72565d1d3037153de14167052733f2cfd553f212de4 2024/06/13 15:43:06 DEBUG Container status: {"Config":{"Image":"elastic-package-service-87356-nginx","Labels":{"com.docker.compose.config-hash":"07b2d6403034e11414161c6489c787924e32ab36e7d12c69b977e184cf40e3c3","com.docker.compose.container-number":"1","com.docker.compose.depends_on":"","com.docker.compose.image":"sha256:106ba62762b92ccdde0edf49e09063ee28a3be98e9342dfcd3980314b0e4c192","com.docker.compose.oneoff":"False","com.docker.compose.project":"elastic-package-service-87356","com.docker.compose.project.config_files":"/opt/buildkite-agent/builds/bk-agent-prod-gcp-1718292959927056914/elastic/elastic-package/test/packages/parallel/nginx/_dev/deploy/docker/docker-compose.yml","com.docker.compose.project.working_dir":"/opt/buildkite-agent/builds/bk-agent-prod-gcp-1718292959927056914/elastic/elastic-package/test/packages/parallel/nginx/_dev/deploy/docker","com.docker.compose.service":"nginx","com.docker.compose.version":"2.24.1","maintainer":"NGINX Docker Maintainers \u003cdocker-maint@nginx.com\u003e"}},"ID":"2306f9dd160408ce2e26b72565d1d3037153de14167052733f2cfd553f212de4","State":{"Status":"running","ExitCode":0,"Health":{"Status":"healthy","Log":[{"Start":"2024-06-13T15:43:05.983549466Z","ExitCode":0,"Output":" % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 97 100 97 0 0 97000 0 --:--:-- --:--:-- --:--:-- 97000\nActive connections: 1 \nserver accepts handled requests\n 1 1 1 \nReading: 0 Writing: 1 Waiting: 0 \n"}]}}}

Maybe these logs are too verbose? But we can leave them and update later when/if we add the trace debug level.

Do you mean to keep the previous log messages ?

I mean to leave it as is now in the PR, and reduce later when we have more log levels.

mrodm · 2024-06-13T17:57:02Z

internal/kibana/agents.go

@@ -118,7 +118,8 @@ func (c *Client) waitUntilPolicyAssigned(ctx context.Context, a Agent, p Policy)
 		if err != nil {
 			return fmt.Errorf("can't get the agent: %w", err)
 		}
-		logger.Debugf("Agent data: %s", agent.String())
+		logger.Debugf("Agent %s (Host: %s): Policy ID %s LogLevel: %s Status: %s",


Probably it could be improved better these logs (removing some field?)

Example of new logs:

2024/06/13 19:33:21 DEBUG Wait until the policy (ID: 1355408f-6518-4bbd-8b39-cded15bba61d, revision: 2) is assigned to the agent (ID: ca20edcb-a929-434d-b226-9138be510e80)... 2024/06/13 19:33:23 DEBUG GET https://127.0.0.1:5601/api/fleet/agents/ca20edcb-a929-434d-b226-9138be510e80 2024/06/13 19:33:23 DEBUG Agent ca20edcb-a929-434d-b226-9138be510e80 (Host: elastic-agent-12113): Policy ID 1355408f-6518-4bbd-8b39-cded15bba61d LogLevel: debug Status: updating

Previously:

2024/06/13 15:43:12 DEBUG Wait until the policy (ID: c9f48c54-c30e-4bb7-a17a-a66f6c1776b0, revision: 2) is assigned to the agent (ID: 8bf156ea-30b2-4714-8e99-d09059f0fe54)... 2024/06/13 15:43:14 DEBUG GET https://127.0.0.1:5601/api/fleet/agents/8bf156ea-30b2-4714-8e99-d09059f0fe54 2024/06/13 15:43:14 DEBUG Agent data: {"id":"8bf156ea-30b2-4714-8e99-d09059f0fe54","policy_id":"c9f48c54-c30e-4bb7-a17a-a66f6c1776b0","local_metadata":{"host":{"name":"docker-fleet-agent"},"elastic":{"agent":{"log_level":"info"}}},"status":"updating"}

mrodm · 2024-06-17T08:58:23Z

internal/testrunner/runners/system/tester.go

+
+	if err != nil {
+		return nil, svcInfo, fmt.Errorf("could not setup service: %w", err)
+	}


Should we move this error check, to allow being set the handler ?

mrodm · 2024-06-17T09:42:49Z

.buildkite/pipeline.yml

@@ -1,6 +1,7 @@
 env:
  SETUP_GVM_VERSION: 'v0.5.2' # https://github.com/andrewkroh/gvm/issues/44#issuecomment-1013231151
  ELASTIC_PACKAGE_COMPOSE_DISABLE_VERBOSE_OUTPUT: "true"
+  ELASTIC_PACKAGE_MAXIMUM_NUMBER_PARALLEL_TESTS: 3


If package has enabled system tests in parallel, those tests will be triggered using 3 routines as maximum.

mrodm · 2024-06-17T15:55:00Z

go.mod

+
+replace github.com/elastic/package-spec/v3 => github.com/mrodm/package-spec/v3 v3.0.0-20240613145150-8d065e83d217


Requires a new version of package-spec

internal/testrunner/testrunner.go

jsoriano

Looks great, also tested locally and works as expected.
Commenting on some of the open questions.

jsoriano · 2024-06-18T07:12:14Z

internal/compose/compose.go

+				logger.Debugf("Container %s status: %s (no health status)", containerDescription.ID, containerDescription.State.Status)
 				continue
 			}

 			// Service is up and running and it's healthy
 			if containerDescription.State.Status == "running" && containerDescription.State.Health.Status == "healthy" {
+				logger.Debugf("Container %s status: %s (health: %s)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.Health.Status)
 				continue
 			}

 			// Container started and finished with exit code 0
 			if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode == 0 {
+				logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
 				continue
 			}

 			// Container exited with code > 0
 			if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode > 0 {
+				logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
 				return fmt.Errorf("container (ID: %s) exited with code %d", containerDescription.ID, containerDescription.State.ExitCode)
 			}

 			// Any different status is considered unhealthy
+			logger.Debugf("Container %s status: unhealthy", containerDescription.ID)


Maybe these logs are too verbose? But we can leave them and update later when/if we add the trace debug level.

internal/testrunner/runners/policy/tester.go

jsoriano · 2024-06-18T07:50:39Z

internal/testrunner/runners/system/tester.go


 	partial, err := r.runTest(ctx, testConfig, svcInfo)

 	tdErr := r.tearDownTest(ctx)
 	if err != nil {
-		return partial, err
+		return partial, errors.Join(err, fmt.Errorf("failed to tear down runner: %w", tdErr))


Errors caused by interruptions are not shown by intention, they use to be long chains of wrapped errors not very useful.

We capture interruptions here:

elastic-package/main.go

Lines 27 to 30 in ebb1446

if errIsInterruption(err) {

rootCmd.Println("interrupted")

os.Exit(130)

}

If both the test and the tear down fail, I think we are fine with returning only the test error.

internal/testrunner/testrunner.go

test/packages/parallel/apache/_dev/test/config.yml

mrodm · 2024-06-18T14:35:16Z

internal/testrunner/runners/policy/testconfig.go

 )

 type testConfig struct {
+	testrunner.SkippableConfig `config:",inline"`


Added the option to skip individual policy tests.

jsoriano

Great 👍

internal/testrunner/testrunner.go

jsoriano · 2024-06-18T21:22:49Z

internal/compose/compose.go

+				logger.Debugf("Container %s status: %s (no health status)", containerDescription.ID, containerDescription.State.Status)
 				continue
 			}

 			// Service is up and running and it's healthy
 			if containerDescription.State.Status == "running" && containerDescription.State.Health.Status == "healthy" {
+				logger.Debugf("Container %s status: %s (health: %s)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.Health.Status)
 				continue
 			}

 			// Container started and finished with exit code 0
 			if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode == 0 {
+				logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
 				continue
 			}

 			// Container exited with code > 0
 			if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode > 0 {
+				logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
 				return fmt.Errorf("container (ID: %s) exited with code %d", containerDescription.ID, containerDescription.State.ExitCode)
 			}

 			// Any different status is considered unhealthy
+			logger.Debugf("Container %s status: unhealthy", containerDescription.ID)


I mean to leave it as is now in the PR, and reduce later when we have more log levels.

elasticmachine · 2024-06-19T10:29:58Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 8d8fb5d

Failed CI Steps

History

💔 Build #3451 failed 5f04b26
💚 Build #3450 succeeded e891842
💚 Build #3443 succeeded d579fb7
💚 Build #3438 succeeded 419f8ea

cc @mrodm

mrodm added 12 commits June 13, 2024 17:11

Prepare RunSuite to trigger tests in parallel

ecfe60c

Reduce log from container status - temporal

bc822c7

Add debug message

ef1383c

Rename field - default parallel false

85eb196

Avod checking agent logs if test is skipped

4e928a0

Show agent data instead of JSON

7719ea8

Move pkgManifest and dataStreamManifest to constructor tester

cfb1905

Remove extra parameter in Debugf

a68f6fd

Update package-spec dep

ac95e8b

Ensure routines finished if context cancelled or signal received

fa31ae5

Reorder where handlers are defined

4c926bf

Return service to be able to set the shutdown service handler

ae89c2e

mrodm self-assigned this Jun 13, 2024

mrodm commented Jun 13, 2024

View reviewed changes

internal/testrunner/globaltestconfig.go Outdated Show resolved Hide resolved

mrodm commented Jun 13, 2024

View reviewed changes

mrodm added 3 commits June 13, 2024 19:59

Add license header

c36f4bd

Revert reorder of checkEnrolledAgents and handlers

3213751

Restore debug message

06d5b77

mrodm commented Jun 17, 2024

View reviewed changes

mrodm added 4 commits June 17, 2024 10:59

Update comments

74bdd72

Rename function and set private struct

b50f5cd

Enable parallel tests in apache and nginx

37f8d44

Remove duplicated debug log (shown in system runner)

f608d06

mrodm commented Jun 17, 2024

View reviewed changes

Fix format global config

e642460

mrodm commented Jun 17, 2024

View reviewed changes

Update log message

1c4b9cf

This was referenced Jun 17, 2024

PoC: System tests in parallel main and installation package as first step #1877

Closed

PoC: Run system tests in parallel (per data stream) #1843

Closed

mrodm marked this pull request as ready for review June 17, 2024 17:34

mrodm requested a review from a team as a code owner June 17, 2024 17:34

Update docs

419f8ea

mrodm commented Jun 17, 2024

View reviewed changes

internal/testrunner/testrunner.go Outdated Show resolved Hide resolved

Update docs

d579fb7

jsoriano reviewed Jun 18, 2024

View reviewed changes

mrodm added 7 commits June 18, 2024 13:30

Return just the error from tests

d8a16e9

Run system tests in parallel for sql_input

94b533f

Update variable name

7b52032

Set maximum routines dynamically

a3c7288

Sort test results

3cc537d

Remove duplicated test config files

e891842

Add function to check if there is any skip configuration

5f04b26

mrodm commented Jun 18, 2024

View reviewed changes

Check if test config is nil

59f7ef9

mrodm requested a review from jsoriano June 18, 2024 15:52

jsoriano approved these changes Jun 18, 2024

View reviewed changes

mrodm added 2 commits June 19, 2024 12:16

Remove TODO comments

ec7cb78

Rename variable

8d8fb5d

mrodm added 3 commits June 19, 2024 17:07

Add notes about running system tests in parallel

9a228b9

Remove package-spec replace in go.mod

a47379e

Merge remote-tracking branch 'upstream/main' into system-tests-parallel

f9ffd44

mrodm mentioned this pull request Jun 19, 2024

Create different deployer folders per each test #1919

Merged

mrodm merged commit 700dd69 into elastic:main Jun 19, 2024
3 checks passed

mrodm deleted the system-tests-parallel branch June 19, 2024 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run system tests in parallel #1909

Run system tests in parallel #1909

mrodm commented Jun 13, 2024 •

edited

Loading

mrodm Jun 13, 2024

jsoriano Jun 18, 2024

mrodm Jun 18, 2024

mrodm Jun 13, 2024

mrodm Jun 13, 2024

mrodm Jun 13, 2024

mrodm Jun 13, 2024

mrodm Jun 18, 2024

mrodm Jun 13, 2024

jsoriano Jun 18, 2024

mrodm Jun 18, 2024

jsoriano Jun 18, 2024

mrodm Jun 13, 2024 •

edited

Loading

mrodm Jun 17, 2024

mrodm Jun 17, 2024

mrodm Jun 17, 2024

jsoriano left a comment

jsoriano Jun 18, 2024

jsoriano Jun 18, 2024

mrodm Jun 18, 2024

jsoriano left a comment

jsoriano Jun 18, 2024

elasticmachine commented Jun 19, 2024 •

edited

Loading

	if errIsInterruption(err) {
	rootCmd.Println("interrupted")
	os.Exit(130)
	}


		replace github.com/elastic/package-spec/v3 => github.com/mrodm/package-spec/v3 v3.0.0-20240613145150-8d065e83d217

Run system tests in parallel #1909

Run system tests in parallel #1909

Conversation

mrodm commented Jun 13, 2024 • edited Loading

Checklist

How to test

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrodm Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsoriano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsoriano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Jun 19, 2024 • edited Loading

💛 Build succeeded, but was flaky

Failed CI Steps

History

mrodm commented Jun 13, 2024 •

edited

Loading

mrodm Jun 13, 2024 •

edited

Loading

elasticmachine commented Jun 19, 2024 •

edited

Loading