Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed race condition in the OCI Metrics integration test between retrieval of metrics from registry and asserting that from expected results #4897

Conversation

klustria
Copy link
Member

To fix the issue, here are the list of changes made:

  1. Used CountDownLatches to signal when to start testing, for example, test only after results has been retrieved.
  2. Make OciMetricsCdiExtension Priority higher than MetricsCdiExtension so that it will only start after MetricsCdiExtension has completed.

…ieval of metrics from registry and asserting that from expected results. To fix the issue, here are the list

1. Used CountDownLatches to signal when to start testing, for example, test only after results has been retrieved.
2. Make OciMetricsCdiExtension Priority higher than MetricsCdiExtension so that it will only start after MetricsCdiExtension has completed.
@klustria klustria requested a review from tjquinno September 15, 2022 16:37
@klustria klustria self-assigned this Sep 15, 2022
…ate and fail test if InterruptedException is received in delay()
@klustria klustria requested a review from tjquinno September 16, 2022 18:20
@tjquinno
Copy link
Member

I think there is still a possible race condition (although probably unlikely).

What if the mocking code in doAnswer invokes countDown on the first latch, but then, before the main test code resumes the mocking code runs again? The counter will not have been incremented between the two runs of the mocking code. When the main test thread does resume the scheduled mocking code will have stored the value 1 into both slots in the array and the second assertion will fail. This is probably very unlikely given the delay between runs of the mocking code, but it’s conceivable.

The real goal of the test is to make sure that the value stored in the counter is used correctly in preparing the OCI metrics data, specifically that the mocking code captures 1 and then 2 on successive executions. It does not matter which thread updates the counter's value as long as the counter is incremented after the mocking code captures the value the first time but before it does so the second time.

We can accomplish that goal by:

  1. moving the second counter.inc() from the main test thread to inside the then part of the if (noOfExecutions == 1) in the mocking code. This makes absolutely sure that the second counter.inc() invocation occurs after the mocking code has captured the metric data the first time but before it captures it the second time. (ScheduledExecutorService will not allow the next periodic run to begin until after the preceding one completes.)
  2. moving the countDownLatch1.countDown() invocation from the then part to the else part of that same if. Now the latch means that both data points have been captured and placed into the results array.
  3. removing the second countdown latch.

After the main test thread starts the web server, it awaits on the countdown latch as before. Once the latch allows that thread to pass (which now happens only after the mocking code has run twice and fully loaded the array of results) then the main test thread just does the two assertions on the two array elements which are now both set.

Copy link
Member

@tjquinno tjquinno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, more changes (but at least they simplify things). See the other longer comment.

@klustria
Copy link
Member Author

klustria commented Sep 19, 2022

I agree with the feedback and hence implemented it. Tested it again for around 800 iterations and issue is not reproduced.

@klustria klustria requested a review from tjquinno September 19, 2022 07:12
@klustria
Copy link
Member Author

The goal of this PR is to resolve the issue reported in #4813

Copy link
Member

@tjquinno tjquinno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@klustria klustria merged commit a18a856 into helidon-io:helidon-2.x Sep 20, 2022
klustria added a commit to klustria/helidon that referenced this pull request Jan 11, 2023
The change includes the following:
1. Port of PR helidon-io#4003 that adds the Helidon Metrics to OCI integration
2. Port of PR helidon-io#4897 that fixes race condition in the unit test
3. Adjust code to deal with MP metrics API changes
4. Change endpoint to the ingestion endpoint when posting the metrics as this is not handled anymore by the OCI SDK integration due to changes in the OCI Java SDK v3.
5. Change MonitoringClient.class to Monitoring.class for mocking using Mockito in the unit test as the OCI Java SDK v3 converted some of the methods in MonitoringClient as Final making them difficult to mock.
6. Trim the OCI Metadata value which contains the metric description if the value exceeds 256 characters, otherwise it will fail.
7. OCI Monitoring service only allows a maximum of 50 metrics per posting, hence additional configuration parameters were added to control sending metrics in batches. The configuration parameters are:
   a. batchSize - Maximum no. of metrics to send in a batch. Defaults to 50 which is what OCI allows
   b. batchDelay - Interval between batch posting
   For example if there are 51 metrics and batchSize is set to 25 and batchDelay to 5 seconds, the Helidon metric integration module will divide the posting to 3 batches sending 25 metrics on the 1st and 2nd batches and 1 metric on the 3rd batch with 5 seconds interval between batch posting.
8. Refactor OciMetricsCdiExtension to add a new bean (OciMetricsBean) to handle the Observer method which will inject Monitoring. Previous code of OciMetricsCdiExtension cannot independently handle instantiation of Monitoring client via CDI.
9. Add unit tests to verify batch posting feature and the use of ingestion endpoint.
klustria added a commit that referenced this pull request Jan 13, 2023
* Add Helidon Metrics integration with OCI

The change includes the following:
1. Port of PR #4003 that adds the Helidon Metrics to OCI integration
2. Port of PR #4897 that fixes race condition in the unit test
3. Adjust code to deal with MP metrics API changes
4. Change endpoint to the ingestion endpoint when posting the metrics as this is not handled anymore by the OCI SDK integration due to changes in the OCI Java SDK v3.
5. Change MonitoringClient.class to Monitoring.class for mocking using Mockito in the unit test as the OCI Java SDK v3 converted some of the methods in MonitoringClient as Final making them difficult to mock.
6. Trim the OCI Metadata value which contains the metric description if the value exceeds 256 characters, otherwise it will fail.
7. OCI Monitoring service only allows a maximum of 50 metrics per posting, hence additional configuration parameters were added to control sending metrics in batches. The configuration parameters are:
   a. batchSize - Maximum no. of metrics to send in a batch. Defaults to 50 which is what OCI allows
   b. batchDelay - Interval between batch posting
   For example if there are 51 metrics and batchSize is set to 25 and batchDelay to 5 seconds, the Helidon metric integration module will divide the posting to 3 batches sending 25 metrics on the 1st and 2nd batches and 1 metric on the 3rd batch with 5 seconds interval between batch posting.
8. Refactor OciMetricsCdiExtension to add a new bean (OciMetricsBean) to handle the Observer method which will inject Monitoring. Previous code of OciMetricsCdiExtension cannot independently handle instantiation of Monitoring client via CDI.
9. Add unit tests to verify batch posting feature and the use of ingestion endpoint.
10. Add io.helidon.config.Config as parameter in OCIMetricsBean's Observer method so it can be injected
11. Various changes based on review feedback to fix dependencies, remove use of stream in list, execute rule.onNewWebserver only if enabled, add default value on @ConfigProperty and validate builder methods' parameters are not null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants