-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add automatic testing and aggregation to OSB #655
Conversation
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
1d14510
to
ab31ed0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some considerations for running multiple tests
- When users run multiple tests, we should ensure that they automatically have
--kill-runninng-processes
by default. If it isn't present in the parsed args, we need to add it by default, or else, we'll run into a wall of errors for every test after the first successful test because existing processes might be still alive. - I tested what happens when users cancel (ctrl + c) the test and they'll have to cancel the test for each iteration (e.g. if a user has 100 test iterations, they'll be spamming ctrl + c for some time 😆 ). Alternatively, they can open a separate window and kill all running processes. We should find a way to make it easier to cancel all tests with a single invocation of ctrl + c if possible
- When users run into errors across all tests, we should reduce the help text and errors. If we can't avoid that, we should include an additional parameter called "--cancel-upon-error" or something like that
- There are additional parameters that I think would be useful for running multiple tests:
--time-between-tests
/--sleep-time
(you can improve the name) = this would allow users to specify how much time there should be between tests. It's a nice-to-have as I've seen some users incorporate sleeps between OSB runs because they want to change something on the host's end between tests.--disable-aggregate
= by default we should have this set to False. But some users might not care about aggregating the final results. We should give them the option to disable it.
Some considerations for aggregating results after all tests
- Can we add the relative standard deviation (RSD) in aggregate feature? This would be useful so that users can see the spread of their data and how much each test deviates from one another
- Maybe in user tag it would be include to runs and their ids in the user tags by default? That way users have a way to track down which test execution ids relate to the aggregated one once the console is wiped.
osbenchmark/benchmark.py
Outdated
auto_aggregate_parser = subparsers.add_parser("auto-aggregate", | ||
add_help=False, | ||
help="Run multiple test executions with the same configuration and aggregate the results", | ||
parents=[test_execution_parser]) | ||
auto_aggregate_parser.add_argument( | ||
"--test-iterations", | ||
type=int, | ||
required=True, | ||
help="Number of test executions to run and aggregate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could auto-aggregate
be renamed to execute-tests
and auto_aggregate_parser
be renamed to multiple_test_executions_parser
? Those names would be more suitable since the main intention behind this subcommand is to allow users to run multiple tests.
But before renaming them, please consider the following:
If all the parameters between the test_execution_parser and auto_aggregate_parser are the same except --test-iterations
, what are your thoughts on omitting a separate subcommand and just incorporating --test-iterations
into test_execution_parser and have the multiple tests logic performed in elif subcommand == "execute-test"
?
There doesn't seem to be a large enough difference between running a single test and running multiple tests that justifies creating another subcommand. But let me know if there's another reason why a separate subcommand might be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline and added in the revision 👍
osbenchmark/benchmark.py
Outdated
with semaphore: | ||
dispatch_sub_command(arg_parser, args, cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is semaphore really needed? Wondering if there might be some conflicts with preexisting threading / Thespianpy within OSB. Might be overkilil. I think a simple for-loop should suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid using dispatch_sub_command
and manipulating the args in line 880.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semaphore might be impacting the performance of each run. I ran a command that sets clients and target throughput to 10. This should ensure that final results have throughput of 10 ops/s. When I ran the command with these auto-aggregate
, I see that target throughput is not being achieved across all runs and the aggregated results.
Command: $ opensearch-benchmark auto-aggregate --test-iterations=3 --workload=geonames --target-hosts=XXXXXXX --client-options=basic_auth_user:'XXXXX',basic_auth_password:'XXXXXXX' --include-tasks="term" --test-mode --kill-running-processes --workload-params="search_clients:10,target_throughput:10"
# Each test has the following results
| Segment count | | 374 | |
| Min Throughput | term | 33.13 | ops/s |
| Mean Throughput | term | 35.22 | ops/s |
| Median Throughput | term | 35.69 | ops/s |
| Max Throughput | term | 36.39 | ops/s |
However, when I run the command with execute-test
, it reaches the target throughput.
Command: $ opensearch-benchmark execute-test --workload=geonames --target-hosts=XXXXXXX --client-options=basic_auth_user:'XXXXXXX',basic_auth_password:'XXXXXXX--include-tasks="term" --kill-running-processes --workload-params="search_clients:10,target_throughput:10"
| Segment count | | 374 | |
| Min Throughput | term | 9.99 | ops/s |
| Mean Throughput | term | 10 | ops/s |
| Median Throughput | term | 10 | ops/s |
| Max Throughput | term | 10 | ops/s |
Both commands were run on the same machine. We should reconsider using semaphore and go for a simpler approach. We can discuss this offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is semaphore really needed? Wondering if there might be some conflicts with preexisting threading / Thespianpy within OSB. Might be overkilil. I think a simple for-loop should suffice?
I tried a simple for loop but i was having issues with already running processes 😕 let's discuss more offline like you mentioned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed more offline and used an alternate approach. I tried a few test runs on the same machine with and without the changes, and the results were very similar. Let me know if you find any differences
@OVI3D0 Overall, great work. This feature has been long overdue in OSB. Let's sync offline if you have any questions. |
8df45a2
to
8dd001a
Compare
This is fixed!
Good idea! added this.
Added this as well!
I instead added
I think this would be a great idea for a future PR
Yeah great idea, I added individual test ID's to the user tags of each aggregated test result. |
8dd001a
to
c4012dd
Compare
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
c4012dd
to
fab4166
Compare
else: | ||
console.info("Please enter a valid number of test iterations") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we ever reach this point since argparse by default sets the value to 1
if --test-iterations
is not provided? If not, let's remove else statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I removed this. Although users can potentially enter invalid values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, you're right. Let's add it back
break | ||
|
||
if args.aggregate: | ||
args.test_executions = test_exes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this exception block gets reached, test_exes
will still include the failed test_execution_id
since it was appended in line 990 and we'd be including the failed test execution id in the aggregation, which might result in a failure when OSB attempts to aggregate the successful results.
Maybe move line 990 to be after line 992, assuming execute_test succeeds, or we can pop from the list in the exception block.
On a side note, based on the way execute_test()
function is constructed, I don't think we reach this exception block and aggregate the successful tests in cases where the user cancels the test. We'll need to look into this route but can keep it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, I moved like 990 so it only adds the test ID after the execute_test is called.
osbenchmark/aggregator.py
Outdated
@@ -169,6 +169,7 @@ def build_aggregated_results(self): | |||
test_procedure = loaded_workload.find_test_procedure_or_default(test_exe.test_procedure) | |||
|
|||
test_execution = metrics.create_test_execution(self.config, loaded_workload, test_procedure, test_exe.workload_revision) | |||
test_execution.user_tags = list(self.test_executions.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could assign the user_tag to a k,v pair something like
test_execution.user_tags = {
"aggregation-of-runs": list(self.test_executions.keys())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would help users who are trying to filtering for specific aggregated files in the MDS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't consider this, I fixed it to look like your suggestion above!
osbenchmark/benchmark.py
Outdated
"--cancel-on-error", | ||
action="store_true", | ||
help="Stop executing tests if an error occurs in one of the test iterations (default: false).", | ||
default=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to explicitly set default=False
when using store_true
or store_false
actions in argparse. The default value is automatically set to False for store_true
and True for store_false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha oops. Fixed now :)
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This change enhances OpenSearch Benchmark by allowing users to run multiple iterations of benchmark tests and automatically aggregate the results, all within a single command. New flags have been added to the existing execute command to support this functionality. Users can now specify the number of test iterations with --test-iterations, control result aggregation with
--aggregate
, set a sleep timer between iterations with--sleep-timer
, and choose to cancel on error with--cancel-on-error
. For example, users can run:opensearch-benchmark execute --test-iterations=2 --aggregate=true --workload=geonames --target-hosts=127.0.0.1:9200
This command will run the geonames workload twice and aggregate the results. All existing arguments and flags for test execution remain compatible with these new options, allowing users to customize their benchmark runs as needed. For instance:
opensearch-benchmark execute --test-iterations=2 --aggregate=true --workload=geonames --target-hosts=127.0.0.1:9200 --test-mode --kill-running-processes
This enhancement provides a more streamlined way to conduct multiple test runs and obtain aggregated results, improving the benchmarking workflow in OpenSearch Benchmark.
Issues Resolved
#631
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.