-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pkg/stanza] Make Stanza adapter more synchronous by removing channels and workers #35453
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Removing |
Hi! I would like to take this on if this is not taken yet |
It's yours @bacherfl, thanks for picking it up! |
Thanks @andrzej-stencel! One quick question to clarify: Currently the Should we remove the async channels from the |
Good point @bacherfl. I wasn't aware that the I put "almost" in the sentence above, because we want to keep using the 100-entry buffer in the LogEmitter, which makes it not completely synchronous after removing the channels and workers. |
Thanks for the clarification @andrzej-stencel - then I will proceed with also adapting the log transform processor and the container parser. I also saw that you were doing some performance evaluations in #35454, so I will also use that method to compare the changes I make and update the PR once I have the results |
alright, I think the PR is ready for a first round of reviews - I have added the performance test results to the PR description. CC @andrzej-stencel |
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR introduces a benchmark test for the receiver adapter which tests the performance of handling the log entries sent to the `LogEmitter`, which are then received and converted by the receiver adapter. The changes have been made based on the comment linked here: #35669 (comment) <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Part of #35453 <!--Describe what testing was performed and which tests were added.--> #### Testing Added additional benchmark tests --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
…nous (#35669) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR changes the `LogEmitter` to accept a synchronous consumer callback function for processing a batch of log entries as an alternative to sending log entry batches to a channel. The components that use the `LogEmitter` (adapter and parser) have been adapted accordingly. Also, in the case of the adapter, the log entries are converted directly, rather than sending them over a channel to the converter and receiving the converted results over a different channel. <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes #35453 <!--Describe what testing was performed and which tests were added.--> #### Testing I did some initial performance tests using the `TestLogLargeFiles` load test to see how this change affects the performance. Below are the results: **Before the change (i.e. with async log entry batch processing)** ``` === RUN TestLogLargeFiles/filelog-largefiles-2Gb-lifetime 2024/10/08 09:02:53 | Sent:17,769,795 logs (179,507/sec) | Received:17,755,188 items (179,346/sec) === RUN TestLogLargeFiles/filelog-largefiles-6GB-lifetime 2024/10/08 09:06:29 | Sent:42,857,755 logs (216,465/sec) | Received:42,851,987 items (216,424/sec) Test |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items| ---------------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:| LogLargeFiles/filelog-largefiles-2Gb-lifetime|PASS | 100s| 73.1| 78.4| 106| 118| 18249451| 18249451| LogLargeFiles/filelog-largefiles-6GB-lifetime|PASS | 200s| 87.5| 98.1| 110| 116| 44358460| 44358460| ``` **After the change (i.e. with sync log entry batch processing)** ``` === RUN TestLogLargeFiles/filelog-largefiles-2Gb-lifetime 2024/10/08 10:09:51 Agent RAM (RES): 139 MiB, CPU:71.7% | Sent:17,802,561 logs (179,836/sec) | Received:17,788,273 items (179,680/sec) === RUN TestLogLargeFiles/filelog-largefiles-6GB-lifetime 2024/10/08 10:05:15 Agent RAM (RES): 140 MiB, CPU:95.6% | Sent:42,912,030 logs (216,744/sec) | Received:42,904,306 items (216,689/sec) Test |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items| ---------------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:| LogLargeFiles/filelog-largefiles-2Gb-lifetime|PASS | 100s| 74.8| 78.9| 127| 139| 17984687| 17984687| LogLargeFiles/filelog-largefiles-6GB-lifetime|PASS | 200s| 89.3| 100.9| 134| 140| 43376210| 43376210| ``` Those results seem to indicate comparable throughput, but with an increased resource consumption, especially in terms of memory. I also did a test comparing the performance between the synchronous and asynchronous log emitter using the same methodology as in #35454. The results were the following, and indicate an increase in the time it takes for reading the generated log file (see #35454 for details on how the file is generated and the test execution): - Async Log Emitter: ~8s - Sync Log Emitter: ~12s <details> <summary>output-async.log</summary> === Step 3: Thu Oct 10 08:54:23 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 2.209674e+06 === Step 4: Thu Oct 10 08:54:25 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 5.428103e+06 === Step 5: Thu Oct 10 08:54:26 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 7.337017e+06 === Step 6: Thu Oct 10 08:54:27 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 9.258843e+06 === Step 7: Thu Oct 10 08:54:29 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.3082428e+07 === Step 8: Thu Oct 10 08:54:31 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.6519068e+07 </details> <details> <summary>output-sync.log</summary> === Step 2: Thu Oct 10 08:51:27 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.580891e+06 === Step 3: Thu Oct 10 08:51:28 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 3.01034e+06 === Step 4: Thu Oct 10 08:51:29 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 4.434627e+06 === Step 5: Thu Oct 10 08:51:31 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 7.416612e+06 === Step 6: Thu Oct 10 08:51:34 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.0496072e+07 === Step 7: Thu Oct 10 08:51:36 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.3523882e+07 === Step 8: Thu Oct 10 08:51:37 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.4929707e+07 === Step 9: Thu Oct 10 08:51:39 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.6519105e+07 </details> --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com> Co-authored-by: Andrzej Stencel <andrzej.stencel@elastic.co>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR removes the `Converter` type that was previously used mainly by the stanza receiver adapter (see #35669 (comment) for more details). Two other receivers were still using the converter to generate test data within the unit tests, so those have been adapted as well with this PR <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Follow up to #35453 <!--Describe what testing was performed and which tests were added.--> #### Testing Adapted unit tests that were still using the converter --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR removes the `Converter` type that was previously used mainly by the stanza receiver adapter (see open-telemetry#35669 (comment) for more details). Two other receivers were still using the converter to generate test data within the unit tests, so those have been adapted as well with this PR <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Follow up to open-telemetry#35453 <!--Describe what testing was performed and which tests were added.--> #### Testing Adapted unit tests that were still using the converter --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR introduces a benchmark test for the receiver adapter which tests the performance of handling the log entries sent to the `LogEmitter`, which are then received and converted by the receiver adapter. The changes have been made based on the comment linked here: open-telemetry#35669 (comment) <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Part of open-telemetry#35453 <!--Describe what testing was performed and which tests were added.--> #### Testing Added additional benchmark tests --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
…nous (open-telemetry#35669) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR changes the `LogEmitter` to accept a synchronous consumer callback function for processing a batch of log entries as an alternative to sending log entry batches to a channel. The components that use the `LogEmitter` (adapter and parser) have been adapted accordingly. Also, in the case of the adapter, the log entries are converted directly, rather than sending them over a channel to the converter and receiving the converted results over a different channel. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes open-telemetry#35453 <!--Describe what testing was performed and which tests were added.--> #### Testing I did some initial performance tests using the `TestLogLargeFiles` load test to see how this change affects the performance. Below are the results: **Before the change (i.e. with async log entry batch processing)** ``` === RUN TestLogLargeFiles/filelog-largefiles-2Gb-lifetime 2024/10/08 09:02:53 | Sent:17,769,795 logs (179,507/sec) | Received:17,755,188 items (179,346/sec) === RUN TestLogLargeFiles/filelog-largefiles-6GB-lifetime 2024/10/08 09:06:29 | Sent:42,857,755 logs (216,465/sec) | Received:42,851,987 items (216,424/sec) Test |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items| ---------------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:| LogLargeFiles/filelog-largefiles-2Gb-lifetime|PASS | 100s| 73.1| 78.4| 106| 118| 18249451| 18249451| LogLargeFiles/filelog-largefiles-6GB-lifetime|PASS | 200s| 87.5| 98.1| 110| 116| 44358460| 44358460| ``` **After the change (i.e. with sync log entry batch processing)** ``` === RUN TestLogLargeFiles/filelog-largefiles-2Gb-lifetime 2024/10/08 10:09:51 Agent RAM (RES): 139 MiB, CPU:71.7% | Sent:17,802,561 logs (179,836/sec) | Received:17,788,273 items (179,680/sec) === RUN TestLogLargeFiles/filelog-largefiles-6GB-lifetime 2024/10/08 10:05:15 Agent RAM (RES): 140 MiB, CPU:95.6% | Sent:42,912,030 logs (216,744/sec) | Received:42,904,306 items (216,689/sec) Test |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items| ---------------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:| LogLargeFiles/filelog-largefiles-2Gb-lifetime|PASS | 100s| 74.8| 78.9| 127| 139| 17984687| 17984687| LogLargeFiles/filelog-largefiles-6GB-lifetime|PASS | 200s| 89.3| 100.9| 134| 140| 43376210| 43376210| ``` Those results seem to indicate comparable throughput, but with an increased resource consumption, especially in terms of memory. I also did a test comparing the performance between the synchronous and asynchronous log emitter using the same methodology as in open-telemetry#35454. The results were the following, and indicate an increase in the time it takes for reading the generated log file (see open-telemetry#35454 for details on how the file is generated and the test execution): - Async Log Emitter: ~8s - Sync Log Emitter: ~12s <details> <summary>output-async.log</summary> === Step 3: Thu Oct 10 08:54:23 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 2.209674e+06 === Step 4: Thu Oct 10 08:54:25 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 5.428103e+06 === Step 5: Thu Oct 10 08:54:26 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 7.337017e+06 === Step 6: Thu Oct 10 08:54:27 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 9.258843e+06 === Step 7: Thu Oct 10 08:54:29 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.3082428e+07 === Step 8: Thu Oct 10 08:54:31 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.6519068e+07 </details> <details> <summary>output-sync.log</summary> === Step 2: Thu Oct 10 08:51:27 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.580891e+06 === Step 3: Thu Oct 10 08:51:28 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 3.01034e+06 === Step 4: Thu Oct 10 08:51:29 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 4.434627e+06 === Step 5: Thu Oct 10 08:51:31 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 7.416612e+06 === Step 6: Thu Oct 10 08:51:34 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.0496072e+07 === Step 7: Thu Oct 10 08:51:36 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.3523882e+07 === Step 8: Thu Oct 10 08:51:37 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.4929707e+07 === Step 9: Thu Oct 10 08:51:39 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.6519105e+07 </details> --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com> Co-authored-by: Andrzej Stencel <andrzej.stencel@elastic.co>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR removes the `Converter` type that was previously used mainly by the stanza receiver adapter (see open-telemetry#35669 (comment) for more details). Two other receivers were still using the converter to generate test data within the unit tests, so those have been adapted as well with this PR <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Follow up to open-telemetry#35453 <!--Describe what testing was performed and which tests were added.--> #### Testing Adapted unit tests that were still using the converter --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>
Component(s)
pkg/stanza
Is your feature request related to a problem? Please describe.
This issue is created as a result of discussion in #31074. From #31074 (comment):
and:
Describe the solution you'd like
Remove the channels and workers by having
LogEmitter
convert Stanza entries to log records and callConsumeLogs
synchronously. Leave the 100-log buffering in place for this change, to minimize the impact. Compare benchmarks.Describe alternatives you've considered
Leaving the 100-log buffering in place leaves the issue of losing logs still not completely resolved. Ideally we want to get rid of this buffering, but we need to step cautiously, as this may have serious performance impact. We should measure this impact and possibly explore possibilities for Stanza receivers like Filelog receiver / File consumer to emit entries in batches, not one by one. Implementing this would alleviate the possible performance impact that removing of batching in
LogEmitter
may introduce.Additional context
See #31074
The text was updated successfully, but these errors were encountered: