Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 32-bit rollovers and add rounding in iostat metrics #30679

Merged

Conversation

fearful-symmetry
Copy link
Contributor

What does this PR do?

Not assigning any reviewers yet, still a bit paranoid and testing this one.

This is a fix for: #30480

After wading through some conflicting kernel docs, I discovered that certain fields reported by /proc/diskstats are in fact unsigned 32-bit integers, which means the point at which they roll over is a relatively low 4.2 billion. If we're not careful, this will result in us overflowing a bunch of unsigned values when we do current - last on the iostat math. This adds a little wrapper that tries to "fix" a rolled-over 32-bit value based on a prior good value. This also adds some rounding for the float values, just to clean up the math.

Why is it important?

This bug can result in sporadic bad data on systems with high IO load or long uptimes.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@fearful-symmetry fearful-symmetry added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-7.17 Automated backport to the 7.17 branch with mergify labels Mar 4, 2022
@fearful-symmetry fearful-symmetry self-assigned this Mar 4, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 4, 2022
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 4, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-03-04T22:46:54.119+0000

  • Duration: 132 min 1 sec

Test stats 🧪

Test Results
Failed 0
Passed 42426
Skipped 3714
Total 46140

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@belimawr belimawr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read you're still testing this one, so sorry if I'm stating the obvious.

What about adding a test to CalcIOStatistics that simulates the rollover?

libbeat/metric/system/diskio/diskstat_linux.go Outdated Show resolved Hide resolved
libbeat/metric/system/diskio/diskstat_linux_test.go Outdated Show resolved Hide resolved
libbeat/metric/system/diskio/diskstat_linux.go Outdated Show resolved Hide resolved
@fearful-symmetry fearful-symmetry requested review from a team, cmacknz and belimawr March 4, 2022 18:40
return current - prev
}
// we're at a uint64 if we hit this
if prev > maxUint32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the underlying value is 32 bits, this will never happen?

If it's 64 bits, wouldn't you want to do the same math with math.MaxUint64?

The only time you will need to do any math that depends on the actual integer width is when rollover occurs. Aren't you guaranteed that prev > maxUint32 in that case, and you can switch the max for the calculation to use maxUint64 instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, that felt like a paranoid edge case that I should in theory cover, but I wasn't sure if it made sense to actually bother. I mean, if we actually want to "fix" 64-bit rollover, that function should be on all fields, not just a few.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, in that case let's put 32 in the function name so that it's obvious that we are only trying to fix this for 32 bit counters.

returnOrFix32BitRollover or something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safest way to address this would be to change the signature to func(curr, prev uint32) uint64, bit widening within the function. This would also have the advantage of statically removing the possibility of this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ That is much better than just changing the name of the function, or hoping this doesn't happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does have consequences that concern me; if the kernel changes type for this in the future, the call will still work but with silently corrupted results. What you want to cover that case is non-wrap-around integer arithmetic which would require the full width to be known at call time. This would either be (with ugliness) passing both the 32-bit truncation and the 64-bit original, or with a type conversion helper that signals the unexpected high bits somehow — the calling functions return an error, so this is possible, but with four calls it starts to get unwieldy.

Just thought I should bring these up to avoid sending you down the wrong rabbit hole.


// See https://docs.kernel.org/admin-guide/iostats.html and https://github.com/torvalds/linux/blob/master/block/genhd.c diskstats_show()
func returnOrFixRollover(current, prev uint64) uint64 {
var maxUint32 uint64 = math.MaxUint32 //4_294_967_295 Max value in uint32/unsigned int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math.MaxUint32 is an untyped constant, so this should not be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, did it as a separate variable more in the hopes of making the logic a little easier to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it does; the point of having these as untyped was exactly to allow this kind of use.

return current - prev
}
// we're at a uint64 if we hit this
if prev > maxUint32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safest way to address this would be to change the signature to func(curr, prev uint32) uint64, bit widening within the function. This would also have the advantage of statically removing the possibility of this case.

result.AvgRequestSize = size
result.AvgQueueSize = queue
result.AvgAwaitTime = wait
result.AvgRequestSize = common.Round(size, common.DefaultDecimalPlacesCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily a concern, but this article does a nice job of explaining the pitfalls of rounding floats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. In this case, that common.Round idiom is everywhere in beats, and it was bugging me to not have it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for here, but it's probably worth looking at changing that; float rounding should really not happen until render time.

@fearful-symmetry fearful-symmetry merged commit ff32f15 into elastic:main Mar 7, 2022
mergify bot pushed a commit that referenced this pull request Mar 7, 2022
* PoC for optional json encoding

* Revert "PoC for optional json encoding"

This reverts commit 3550969.

* try to fix rolled-over values in diskio, add rounding

* use math package, add docs

* name change

* change name, add changelog

(cherry picked from commit ff32f15)
fearful-symmetry added a commit that referenced this pull request Mar 8, 2022
)

* PoC for optional json encoding

* Revert "PoC for optional json encoding"

This reverts commit 3550969.

* try to fix rolled-over values in diskio, add rounding

* use math package, add docs

* name change

* change name, add changelog

(cherry picked from commit ff32f15)

Co-authored-by: Alex K <8418476+fearful-symmetry@users.noreply.github.com>
@thekofimensah
Copy link

Which version will this be available to? 7.17.0?

@thekofimensah
Copy link

@fearful-symmetry I looked around and I couldn't find this change in any of the new releases' source codes: https://github.com/elastic/beats/releases what version can I expect to find this change?

@fearful-symmetry
Copy link
Contributor Author

@thekofimensah it should be available in 7.17.2, which will be released at the end of the month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants