Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: sync scheduler resolve multi task user issue #6818

Merged
merged 1 commit into from
Apr 13, 2023

Conversation

matthewfala
Copy link
Contributor

@matthewfala matthewfala commented Feb 9, 2023

This is a fix to the sync core scheduler implemented for the Cloudwatch Logs output plugin.

Please see this issue for more information on the fix:
#6849

Signed-off-by: Matthew Fala falamatt@amazon.com


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@edsiper
Copy link
Member

edsiper commented Feb 10, 2023

assigned to @leonardo-albertovich for review.

@leonardo-albertovich
Copy link
Collaborator

Hi @matthewfala, could you please briefly summarize the bug and your approach to this fix so the review process is faster? I'm not familiar with the singleplex mechanism, from what I remember the idea was that some AWS plugins required this change to be able to use asynchronous networking while retaining strict ordering but that's as far as my knowledge goes.

Thank you very much.

PettitWesley
PettitWesley previously approved these changes Feb 14, 2023
Copy link
Contributor

@PettitWesley PettitWesley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to see a public GitHub write explaining the issue and why this fix is the most ideal option.

Signed-off-by: Matthew Fala <falamatt@amazon.com>
@matthewfala matthewfala temporarily deployed to pr February 14, 2023 01:38 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr February 14, 2023 01:38 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr February 14, 2023 01:55 — with GitHub Actions Inactive
@matthewfala
Copy link
Contributor Author

matthewfala commented Feb 14, 2023

@leonardo-albertovich,
That is correct! The synchronous scheduler just ensures that only one task runs at a time for an output plugin instance to maintain strict ordering like you mentioned.
It was written to allow for CloudWatch Logs plugin to use the async network stack (the sync network stack had issues) while keeping network operations synchronous which is needed for the CloudWatch API.

The Sync Core Scheduler adds a task queue to each out_instance where tasks are queued to instead of being flushed/dispatched off the control thread.

The scheduler does not admit a queued task for flush/dispatch until the previous task has completed.

The scheduler had a problem where the task user count was not properly tracked while the task is waiting for other tasks to complete in the task queue.

There is a premature task deletion issue when a task is sent to multiple output instances. If there are two output instances A and B and A is busy processing some other task but B is not busy, A will enqueue the task and not increment the user count. B will dispatch the task, incrementing user count from 0 to 1, and then on completion, the user count will decrement from 1 to 0. This will trigger the task to be automatically deleted due to there being no users.

Later, when A frees up and begins to process the task, there is a segfault because the task is already freed.

I'm writing an issue to describe more about the solution.

@leonardo-albertovich
Copy link
Collaborator

The explanation is clear and it seems to make sense to me. I'd be lying if I said I have a clear picture of everything in that part of the code but I trust you have properly done the RCA and tested your patch as usual.

One thing I noticed while working on PR #6852 is that the users field in the flb_task structure is a signed integer and neither flb_task_users_inc nor flb_task_users_dec check it to ensure that it's within boundaries in addition to flb_task_users_release only accepting zero as a valid way signal that the task is ready to be disposed of (which is correct from a theoretical standpoint and also less prone to cause illegal memory accesses in exchange to leaving the door open to leaving some chunks behind that would be re-sent when the system is restarted but only on "catastrophic" situations).

I don't think it's a huge issue but I thought I'd mention it to get some additional eyes on it, I don't think adding a conditional in both of those would have a dramatic effect on performance and if we add an error message in cases where we detect this it could help us find some rare and nasty bugs (if they exist).

@matthewfala
Copy link
Contributor Author

Thank you for your review and comments, @leonardo-albertovich

@matthewfala
Copy link
Contributor Author

@edsiper, would it be possible to get these both merged in?
Master: #6818
1.9: #6817

It affects cloudwatch stability

If you want to wait for us to have a list of final prs for 1.9 before merging them in, please let me know and we can hold off till then.

@leonardo-albertovich
Copy link
Collaborator

If these PRs are to be merged (which I think is the case) and the same bug is present in 2.0 I think it would be important to backport it and merge all three at the same time, could you do that so we can wrap this up asap?

@matthewfala
Copy link
Contributor Author

Hi Leonardo, I realized recently that this PR is still pending merge. We've been cherry picking it into our aws-for-fluent-bit container builds for the last 2 major versions, so I think it's relatively stable.

Here's the other related PRs:
master: this pr
2.0: #7170
1.9: #6817

Not sure about the 1.9 backport. We can close it if needed and continue cherrypicking for our builds.

@matthewfala
Copy link
Contributor Author

@leonardo-albertovich

@leonardo-albertovich leonardo-albertovich merged commit ed67f03 into fluent:master Apr 13, 2023
@leonardo-albertovich
Copy link
Collaborator

I just re-scheduled a test that failed due to unrelated reasons in the 2.0 backport and once CI is done I'll merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants