add automatic testing and aggregation to OSB #655

OVI3D0 · 2024-09-24T18:34:59Z

Description

This change enhances OpenSearch Benchmark by allowing users to run multiple iterations of benchmark tests and automatically aggregate the results, all within a single command. New flags have been added to the existing execute command to support this functionality. Users can now specify the number of test iterations with --test-iterations, control result aggregation with --aggregate, set a sleep timer between iterations with --sleep-timer, and choose to cancel on error with --cancel-on-error. For example, users can run:

opensearch-benchmark execute --test-iterations=2 --aggregate=true --workload=geonames --target-hosts=127.0.0.1:9200

This command will run the geonames workload twice and aggregate the results. All existing arguments and flags for test execution remain compatible with these new options, allowing users to customize their benchmark runs as needed. For instance:

opensearch-benchmark execute --test-iterations=2 --aggregate=true --workload=geonames --target-hosts=127.0.0.1:9200 --test-mode --kill-running-processes

This enhancement provides a more streamlined way to conduct multiple test runs and obtain aggregated results, improving the benchmarking workflow in OpenSearch Benchmark.

Issues Resolved

#631

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

IanHoang

Some considerations for running multiple tests

When users run multiple tests, we should ensure that they automatically have --kill-runninng-processes by default. If it isn't present in the parsed args, we need to add it by default, or else, we'll run into a wall of errors for every test after the first successful test because existing processes might be still alive.
I tested what happens when users cancel (ctrl + c) the test and they'll have to cancel the test for each iteration (e.g. if a user has 100 test iterations, they'll be spamming ctrl + c for some time 😆 ). Alternatively, they can open a separate window and kill all running processes. We should find a way to make it easier to cancel all tests with a single invocation of ctrl + c if possible
When users run into errors across all tests, we should reduce the help text and errors. If we can't avoid that, we should include an additional parameter called "--cancel-upon-error" or something like that
There are additional parameters that I think would be useful for running multiple tests:
- --time-between-tests / --sleep-time (you can improve the name) = this would allow users to specify how much time there should be between tests. It's a nice-to-have as I've seen some users incorporate sleeps between OSB runs because they want to change something on the host's end between tests.
- --disable-aggregate = by default we should have this set to False. But some users might not care about aggregating the final results. We should give them the option to disable it.

Some considerations for aggregating results after all tests

Can we add the relative standard deviation (RSD) in aggregate feature? This would be useful so that users can see the spread of their data and how much each test deviates from one another
Maybe in user tag it would be include to runs and their ids in the user tags by default? That way users have a way to track down which test execution ids relate to the aggregated one once the console is wiped.

IanHoang · 2024-09-25T20:39:16Z

osbenchmark/benchmark.py

+    auto_aggregate_parser = subparsers.add_parser("auto-aggregate",
+                                                  add_help=False,
+                                                  help="Run multiple test executions with the same configuration and aggregate the results",
+                                                  parents=[test_execution_parser])
+    auto_aggregate_parser.add_argument(
+        "--test-iterations",
+        type=int,
+        required=True,
+        help="Number of test executions to run and aggregate")


Could auto-aggregate be renamed to execute-tests and auto_aggregate_parser be renamed to multiple_test_executions_parser? Those names would be more suitable since the main intention behind this subcommand is to allow users to run multiple tests.

But before renaming them, please consider the following:
If all the parameters between the test_execution_parser and auto_aggregate_parser are the same except --test-iterations, what are your thoughts on omitting a separate subcommand and just incorporating --test-iterations into test_execution_parser and have the multiple tests logic performed in elif subcommand == "execute-test"?

There doesn't seem to be a large enough difference between running a single test and running multiple tests that justifies creating another subcommand. But let me know if there's another reason why a separate subcommand might be better.

Discussed offline and added in the revision 👍

IanHoang · 2024-09-25T20:42:25Z

osbenchmark/benchmark.py

+        with semaphore:
+            dispatch_sub_command(arg_parser, args, cfg)


Is semaphore really needed? Wondering if there might be some conflicts with preexisting threading / Thespianpy within OSB. Might be overkilil. I think a simple for-loop should suffice?

We should avoid using dispatch_sub_command and manipulating the args in line 880.

Is semaphore really needed? Wondering if there might be some conflicts with preexisting threading / Thespianpy within OSB. Might be overkilil. I think a simple for-loop should suffice?

I tried a simple for loop but i was having issues with already running processes 😕 let's discuss more offline like you mentioned

Discussed more offline and used an alternate approach. I tried a few test runs on the same machine with and without the changes, and the results were very similar. Let me know if you find any differences

osbenchmark/benchmark.py

IanHoang · 2024-09-25T20:58:17Z

@OVI3D0 Overall, great work. This feature has been long overdue in OSB. Let's sync offline if you have any questions.

OVI3D0 · 2024-09-27T19:17:50Z

I tested what happens when users cancel (ctrl + c) the test and they'll have to cancel the test for each iteration (e.g. if a user has 100 test iterations, they'll be spamming ctrl + c for some time 😆 ). Alternatively, they can open a separate window and kill all running processes. We should find a way to make it easier to cancel all tests with a single invocation of ctrl + c if possible

This is fixed!

When users run into errors across all tests, we should reduce the help text and errors. If we can't avoid that, we should include an additional parameter called "--cancel-upon-error" or something like that

Good idea! added this.

--time-between-tests / --sleep-time (you can improve the name) = this would allow users to specify how much time there should be between tests. It's a nice-to-have as I've seen some users incorporate sleeps between OSB runs because they want to change something on the host's end between tests.

Added this as well!

--disable-aggregate = by default we should have this set to False. But some users might not care about aggregating the final results. We should give them the option to disable it.

I instead added --aggregate and defaulted it to True. But if you think it makes more sense the other way around then I can fix this

Some considerations for aggregating results after all tests

Can we add the relative standard deviation (RSD) in aggregate feature? This would be useful so that users can see the spread of their data and how much each test deviates from one another

I think this would be a great idea for a future PR

Maybe in user tag it would be include to runs and their ids in the user tags by default? That way users have a way to track down which test execution ids relate to the aggregated one once the console is wiped.

Yeah great idea, I added individual test ID's to the user tags of each aggregated test result.

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

IanHoang · 2024-09-30T20:41:37Z

osbenchmark/benchmark.py

+            else:
+                console.info("Please enter a valid number of test iterations")


Will we ever reach this point since argparse by default sets the value to 1 if --test-iterations is not provided? If not, let's remove else statement

True, I removed this. Although users can potentially enter invalid values

Oops, you're right. Let's add it back

IanHoang · 2024-10-02T18:51:34Z

osbenchmark/benchmark.py

+                            break
+
+                if args.aggregate:
+                    args.test_executions = test_exes


If this exception block gets reached, test_exeswill still include the failed test_execution_id since it was appended in line 990 and we'd be including the failed test execution id in the aggregation, which might result in a failure when OSB attempts to aggregate the successful results.

Maybe move line 990 to be after line 992, assuming execute_test succeeds, or we can pop from the list in the exception block.

On a side note, based on the way execute_test() function is constructed, I don't think we reach this exception block and aggregate the successful tests in cases where the user cancels the test. We'll need to look into this route but can keep it for now.

Nice catch, I moved like 990 so it only adds the test ID after the execute_test is called.

IanHoang · 2024-10-02T19:07:58Z

osbenchmark/aggregator.py

@@ -169,6 +169,7 @@ def build_aggregated_results(self):
        test_procedure = loaded_workload.find_test_procedure_or_default(test_exe.test_procedure)

        test_execution = metrics.create_test_execution(self.config, loaded_workload, test_procedure, test_exe.workload_revision)
+        test_execution.user_tags = list(self.test_executions.keys())


Alternatively, we could assign the user_tag to a k,v pair something like

test_execution.user_tags = { "aggregation-of-runs": list(self.test_executions.keys()) }

Would help users who are trying to filtering for specific aggregated files in the MDS

I didn't consider this, I fixed it to look like your suggestion above!

IanHoang · 2024-10-02T19:46:29Z

osbenchmark/benchmark.py

+        "--cancel-on-error",
+        action="store_true",
+        help="Stop executing tests if an error occurs in one of the test iterations (default: false).",
+        default=False


No need to explicitly set default=False when using store_true or store_false actions in argparse. The default value is automatically set to False for store_true and True for store_false.

Haha oops. Fixed now :)

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

IanHoang

LGTM

add automatic testing and aggregation to OSB

ab31ed0

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

OVI3D0 force-pushed the add-auto-aggregation branch from 1d14510 to ab31ed0 Compare September 24, 2024 18:39

OVI3D0 marked this pull request as ready for review September 24, 2024 19:11

OVI3D0 requested review from IanHoang, gkamat, beaioun, cgchinmay, rishabh6788 and VijayanB as code owners September 24, 2024 19:11

IanHoang self-assigned this Sep 25, 2024

IanHoang reviewed Sep 25, 2024

View reviewed changes

OVI3D0 requested a review from IanHoang September 27, 2024 19:09

OVI3D0 force-pushed the add-auto-aggregation branch from 8df45a2 to 8dd001a Compare September 27, 2024 19:14

OVI3D0 force-pushed the add-auto-aggregation branch from 8dd001a to c4012dd Compare September 27, 2024 19:20

move auto-aggregation logic to execute-test subcommand

fab4166

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

OVI3D0 force-pushed the add-auto-aggregation branch from c4012dd to fab4166 Compare September 27, 2024 19:23

IanHoang mentioned this pull request Oct 2, 2024

[META] Introduce Aggregate Command #628

Closed

IanHoang requested changes Oct 2, 2024

View reviewed changes

address a few more comments

54febf3

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

OVI3D0 requested a review from IanHoang October 2, 2024 21:46

add back check for invalid test-iterations input

b9d7137

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>

IanHoang approved these changes Oct 4, 2024

View reviewed changes

IanHoang merged commit 83040a9 into opensearch-project:main Oct 4, 2024
10 checks passed

OVI3D0 mentioned this pull request Oct 11, 2024

add documentation for new aggregate command + auto aggregation opensearch-project/documentation-website#8521

Merged

1 task

OVI3D0 deleted the add-auto-aggregation branch October 28, 2024 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add automatic testing and aggregation to OSB #655

add automatic testing and aggregation to OSB #655

OVI3D0 commented Sep 24, 2024 •

edited

Loading

IanHoang left a comment •

edited

Loading

IanHoang Sep 25, 2024

OVI3D0 Sep 27, 2024

IanHoang Sep 25, 2024

IanHoang Sep 25, 2024

IanHoang Sep 25, 2024

OVI3D0 Sep 26, 2024 •

edited

Loading

OVI3D0 Sep 27, 2024

IanHoang commented Sep 25, 2024

OVI3D0 commented Sep 27, 2024

Some considerations for aggregating results after all tests

IanHoang Sep 30, 2024

OVI3D0 Oct 2, 2024

IanHoang Oct 2, 2024

IanHoang Oct 2, 2024

OVI3D0 Oct 2, 2024

IanHoang Oct 2, 2024

IanHoang Oct 2, 2024

OVI3D0 Oct 2, 2024

IanHoang Oct 2, 2024

OVI3D0 Oct 2, 2024

IanHoang left a comment

		else:
		console.info("Please enter a valid number of test iterations")

add automatic testing and aggregation to OSB #655

add automatic testing and aggregation to OSB #655

Conversation

OVI3D0 commented Sep 24, 2024 • edited Loading

Description

Issues Resolved

IanHoang left a comment • edited Loading

Choose a reason for hiding this comment

Some considerations for running multiple tests

Some considerations for aggregating results after all tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OVI3D0 Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IanHoang commented Sep 25, 2024

OVI3D0 commented Sep 27, 2024

Some considerations for aggregating results after all tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IanHoang left a comment

Choose a reason for hiding this comment

OVI3D0 commented Sep 24, 2024 •

edited

Loading

IanHoang left a comment •

edited

Loading

OVI3D0 Sep 26, 2024 •

edited

Loading