-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: sync scheduler resolve multi task user issue #6818
core: sync scheduler resolve multi task user issue #6818
Conversation
assigned to @leonardo-albertovich for review. |
Hi @matthewfala, could you please briefly summarize the bug and your approach to this fix so the review process is faster? I'm not familiar with the singleplex mechanism, from what I remember the idea was that some AWS plugins required this change to be able to use asynchronous networking while retaining strict ordering but that's as far as my knowledge goes. Thank you very much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to see a public GitHub write explaining the issue and why this fix is the most ideal option.
Signed-off-by: Matthew Fala <falamatt@amazon.com>
1eea924
to
61b320c
Compare
@leonardo-albertovich, The Sync Core Scheduler adds a task queue to each out_instance where tasks are queued to instead of being flushed/dispatched off the control thread. The scheduler does not admit a queued task for flush/dispatch until the previous task has completed. The scheduler had a problem where the task user count was not properly tracked while the task is waiting for other tasks to complete in the task queue. There is a premature task deletion issue when a task is sent to multiple output instances. If there are two output instances A and B and A is busy processing some other task but B is not busy, A will enqueue the task and not increment the user count. B will dispatch the task, incrementing user count from 0 to 1, and then on completion, the user count will decrement from 1 to 0. This will trigger the task to be automatically deleted due to there being no users. Later, when A frees up and begins to process the task, there is a segfault because the task is already freed. I'm writing an issue to describe more about the solution. |
The explanation is clear and it seems to make sense to me. I'd be lying if I said I have a clear picture of everything in that part of the code but I trust you have properly done the RCA and tested your patch as usual. One thing I noticed while working on PR #6852 is that the I don't think it's a huge issue but I thought I'd mention it to get some additional eyes on it, I don't think adding a conditional in both of those would have a dramatic effect on performance and if we add an error message in cases where we detect this it could help us find some rare and nasty bugs (if they exist). |
Thank you for your review and comments, @leonardo-albertovich |
If these PRs are to be merged (which I think is the case) and the same bug is present in 2.0 I think it would be important to backport it and merge all three at the same time, could you do that so we can wrap this up asap? |
Hi Leonardo, I realized recently that this PR is still pending merge. We've been cherry picking it into our aws-for-fluent-bit container builds for the last 2 major versions, so I think it's relatively stable. Here's the other related PRs: Not sure about the 1.9 backport. We can close it if needed and continue cherrypicking for our builds. |
I just re-scheduled a test that failed due to unrelated reasons in the 2.0 backport and once CI is done I'll merge it. |
This is a fix to the sync core scheduler implemented for the Cloudwatch Logs output plugin.
Please see this issue for more information on the fix:
#6849
Signed-off-by: Matthew Fala falamatt@amazon.com
Enter
[N/A]
in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-test
label to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.