Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add automatic testing and aggregation to OSB #655

Merged
merged 4 commits into from
Oct 4, 2024

Conversation

OVI3D0
Copy link
Member

@OVI3D0 OVI3D0 commented Sep 24, 2024

Description

This change enhances OpenSearch Benchmark by allowing users to run multiple iterations of benchmark tests and automatically aggregate the results, all within a single command. New flags have been added to the existing execute command to support this functionality. Users can now specify the number of test iterations with --test-iterations, control result aggregation with --aggregate, set a sleep timer between iterations with --sleep-timer, and choose to cancel on error with --cancel-on-error. For example, users can run:

opensearch-benchmark execute --test-iterations=2 --aggregate=true --workload=geonames --target-hosts=127.0.0.1:9200

This command will run the geonames workload twice and aggregate the results. All existing arguments and flags for test execution remain compatible with these new options, allowing users to customize their benchmark runs as needed. For instance:

opensearch-benchmark execute --test-iterations=2 --aggregate=true --workload=geonames --target-hosts=127.0.0.1:9200 --test-mode --kill-running-processes

This enhancement provides a more streamlined way to conduct multiple test runs and obtain aggregated results, improving the benchmarking workflow in OpenSearch Benchmark.

Issues Resolved

#631

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
@OVI3D0 OVI3D0 force-pushed the add-auto-aggregation branch from 1d14510 to ab31ed0 Compare September 24, 2024 18:39
@OVI3D0 OVI3D0 marked this pull request as ready for review September 24, 2024 19:11
@IanHoang IanHoang self-assigned this Sep 25, 2024
Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some considerations for running multiple tests

  • When users run multiple tests, we should ensure that they automatically have --kill-runninng-processes by default. If it isn't present in the parsed args, we need to add it by default, or else, we'll run into a wall of errors for every test after the first successful test because existing processes might be still alive.
  • I tested what happens when users cancel (ctrl + c) the test and they'll have to cancel the test for each iteration (e.g. if a user has 100 test iterations, they'll be spamming ctrl + c for some time 😆 ). Alternatively, they can open a separate window and kill all running processes. We should find a way to make it easier to cancel all tests with a single invocation of ctrl + c if possible
  • When users run into errors across all tests, we should reduce the help text and errors. If we can't avoid that, we should include an additional parameter called "--cancel-upon-error" or something like that
  • There are additional parameters that I think would be useful for running multiple tests:
    • --time-between-tests / --sleep-time (you can improve the name) = this would allow users to specify how much time there should be between tests. It's a nice-to-have as I've seen some users incorporate sleeps between OSB runs because they want to change something on the host's end between tests.
    • --disable-aggregate = by default we should have this set to False. But some users might not care about aggregating the final results. We should give them the option to disable it.

Some considerations for aggregating results after all tests

  • Can we add the relative standard deviation (RSD) in aggregate feature? This would be useful so that users can see the spread of their data and how much each test deviates from one another
  • Maybe in user tag it would be include to runs and their ids in the user tags by default? That way users have a way to track down which test execution ids relate to the aggregated one once the console is wiped.

Comment on lines 638 to 646
auto_aggregate_parser = subparsers.add_parser("auto-aggregate",
add_help=False,
help="Run multiple test executions with the same configuration and aggregate the results",
parents=[test_execution_parser])
auto_aggregate_parser.add_argument(
"--test-iterations",
type=int,
required=True,
help="Number of test executions to run and aggregate")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could auto-aggregate be renamed to execute-tests and auto_aggregate_parser be renamed to multiple_test_executions_parser? Those names would be more suitable since the main intention behind this subcommand is to allow users to run multiple tests.

But before renaming them, please consider the following:
If all the parameters between the test_execution_parser and auto_aggregate_parser are the same except --test-iterations, what are your thoughts on omitting a separate subcommand and just incorporating --test-iterations into test_execution_parser and have the multiple tests logic performed in elif subcommand == "execute-test"?

There doesn't seem to be a large enough difference between running a single test and running multiple tests that justifies creating another subcommand. But let me know if there's another reason why a separate subcommand might be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline and added in the revision 👍

Comment on lines 887 to 888
with semaphore:
dispatch_sub_command(arg_parser, args, cfg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is semaphore really needed? Wondering if there might be some conflicts with preexisting threading / Thespianpy within OSB. Might be overkilil. I think a simple for-loop should suffice?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid using dispatch_sub_command and manipulating the args in line 880.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semaphore might be impacting the performance of each run. I ran a command that sets clients and target throughput to 10. This should ensure that final results have throughput of 10 ops/s. When I ran the command with these auto-aggregate, I see that target throughput is not being achieved across all runs and the aggregated results.

Command: $ opensearch-benchmark auto-aggregate --test-iterations=3 --workload=geonames --target-hosts=XXXXXXX --client-options=basic_auth_user:'XXXXX',basic_auth_password:'XXXXXXX' --include-tasks="term" --test-mode --kill-running-processes --workload-params="search_clients:10,target_throughput:10"

# Each test has the following results
|                                                  Segment count |        |         374 |        |
|                                                 Min Throughput |   term |       33.13 |  ops/s |
|                                                Mean Throughput |   term |       35.22 |  ops/s |
|                                              Median Throughput |   term |       35.69 |  ops/s |
|                                                 Max Throughput |   term |       36.39 |  ops/s |

However, when I run the command with execute-test, it reaches the target throughput.

Command: $ opensearch-benchmark execute-test --workload=geonames --target-hosts=XXXXXXX --client-options=basic_auth_user:'XXXXXXX',basic_auth_password:'XXXXXXX--include-tasks="term" --kill-running-processes --workload-params="search_clients:10,target_throughput:10"

|                                                  Segment count |        |         374 |        |
|                                                 Min Throughput |   term |        9.99 |  ops/s |
|                                                Mean Throughput |   term |          10 |  ops/s |
|                                              Median Throughput |   term |          10 |  ops/s |
|                                                 Max Throughput |   term |          10 |  ops/s |

Both commands were run on the same machine. We should reconsider using semaphore and go for a simpler approach. We can discuss this offline.

Copy link
Member Author

@OVI3D0 OVI3D0 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is semaphore really needed? Wondering if there might be some conflicts with preexisting threading / Thespianpy within OSB. Might be overkilil. I think a simple for-loop should suffice?

I tried a simple for loop but i was having issues with already running processes 😕 let's discuss more offline like you mentioned

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed more offline and used an alternate approach. I tried a few test runs on the same machine with and without the changes, and the results were very similar. Let me know if you find any differences

osbenchmark/benchmark.py Outdated Show resolved Hide resolved
@IanHoang
Copy link
Collaborator

@OVI3D0 Overall, great work. This feature has been long overdue in OSB. Let's sync offline if you have any questions.

@OVI3D0 OVI3D0 requested a review from IanHoang September 27, 2024 19:09
@OVI3D0 OVI3D0 force-pushed the add-auto-aggregation branch from 8df45a2 to 8dd001a Compare September 27, 2024 19:14
@OVI3D0
Copy link
Member Author

OVI3D0 commented Sep 27, 2024

  • I tested what happens when users cancel (ctrl + c) the test and they'll have to cancel the test for each iteration (e.g. if a user has 100 test iterations, they'll be spamming ctrl + c for some time 😆 ). Alternatively, they can open a separate window and kill all running processes. We should find a way to make it easier to cancel all tests with a single invocation of ctrl + c if possible

This is fixed!

  • When users run into errors across all tests, we should reduce the help text and errors. If we can't avoid that, we should include an additional parameter called "--cancel-upon-error" or something like that

Good idea! added this.

  • --time-between-tests / --sleep-time (you can improve the name) = this would allow users to specify how much time there should be between tests. It's a nice-to-have as I've seen some users incorporate sleeps between OSB runs because they want to change something on the host's end between tests.

Added this as well!

  • --disable-aggregate = by default we should have this set to False. But some users might not care about aggregating the final results. We should give them the option to disable it.

I instead added --aggregate and defaulted it to True. But if you think it makes more sense the other way around then I can fix this

Some considerations for aggregating results after all tests

  • Can we add the relative standard deviation (RSD) in aggregate feature? This would be useful so that users can see the spread of their data and how much each test deviates from one another

I think this would be a great idea for a future PR

  • Maybe in user tag it would be include to runs and their ids in the user tags by default? That way users have a way to track down which test execution ids relate to the aggregated one once the console is wiped.

Yeah great idea, I added individual test ID's to the user tags of each aggregated test result.

@OVI3D0 OVI3D0 force-pushed the add-auto-aggregation branch from 8dd001a to c4012dd Compare September 27, 2024 19:20
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Comment on lines +1009 to +1010
else:
console.info("Please enter a valid number of test iterations")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we ever reach this point since argparse by default sets the value to 1 if --test-iterations is not provided? If not, let's remove else statement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I removed this. Although users can potentially enter invalid values

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you're right. Let's add it back

break

if args.aggregate:
args.test_executions = test_exes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this exception block gets reached, test_exeswill still include the failed test_execution_id since it was appended in line 990 and we'd be including the failed test execution id in the aggregation, which might result in a failure when OSB attempts to aggregate the successful results.

Maybe move line 990 to be after line 992, assuming execute_test succeeds, or we can pop from the list in the exception block.

On a side note, based on the way execute_test() function is constructed, I don't think we reach this exception block and aggregate the successful tests in cases where the user cancels the test. We'll need to look into this route but can keep it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, I moved like 990 so it only adds the test ID after the execute_test is called.

@@ -169,6 +169,7 @@ def build_aggregated_results(self):
test_procedure = loaded_workload.find_test_procedure_or_default(test_exe.test_procedure)

test_execution = metrics.create_test_execution(self.config, loaded_workload, test_procedure, test_exe.workload_revision)
test_execution.user_tags = list(self.test_executions.keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could assign the user_tag to a k,v pair something like

 test_execution.user_tags = {
    "aggregation-of-runs": list(self.test_executions.keys())
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would help users who are trying to filtering for specific aggregated files in the MDS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consider this, I fixed it to look like your suggestion above!

"--cancel-on-error",
action="store_true",
help="Stop executing tests if an error occurs in one of the test iterations (default: false).",
default=False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to explicitly set default=False when using store_true or store_false actions in argparse. The default value is automatically set to False for store_true and True for store_false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha oops. Fixed now :)

Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
@OVI3D0 OVI3D0 requested a review from IanHoang October 2, 2024 21:46
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@IanHoang IanHoang merged commit 83040a9 into opensearch-project:main Oct 4, 2024
10 checks passed
@OVI3D0 OVI3D0 deleted the add-auto-aggregation branch October 28, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

2 participants