core: sync scheduler resolve multi task user issue #6818

matthewfala · 2023-02-09T21:57:09Z

This is a fix to the sync core scheduler implemented for the Cloudwatch Logs output plugin.

Please see this issue for more information on the fix:
#6849

Signed-off-by: Matthew Fala falamatt@amazon.com

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Run local packaging test showing all targets (including any new ones) build.
Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

Documentation required for this feature

Backporting

Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

edsiper · 2023-02-10T09:13:46Z

assigned to @leonardo-albertovich for review.

leonardo-albertovich · 2023-02-10T11:29:35Z

Hi @matthewfala, could you please briefly summarize the bug and your approach to this fix so the review process is faster? I'm not familiar with the singleplex mechanism, from what I remember the idea was that some AWS plugins required this change to be able to use asynchronous networking while retaining strict ordering but that's as far as my knowledge goes.

Thank you very much.

PettitWesley

I want to see a public GitHub write explaining the issue and why this fix is the most ideal option.

Signed-off-by: Matthew Fala <falamatt@amazon.com>

matthewfala · 2023-02-14T03:55:56Z

@leonardo-albertovich,
That is correct! The synchronous scheduler just ensures that only one task runs at a time for an output plugin instance to maintain strict ordering like you mentioned.
It was written to allow for CloudWatch Logs plugin to use the async network stack (the sync network stack had issues) while keeping network operations synchronous which is needed for the CloudWatch API.

The Sync Core Scheduler adds a task queue to each out_instance where tasks are queued to instead of being flushed/dispatched off the control thread.

The scheduler does not admit a queued task for flush/dispatch until the previous task has completed.

The scheduler had a problem where the task user count was not properly tracked while the task is waiting for other tasks to complete in the task queue.

There is a premature task deletion issue when a task is sent to multiple output instances. If there are two output instances A and B and A is busy processing some other task but B is not busy, A will enqueue the task and not increment the user count. B will dispatch the task, incrementing user count from 0 to 1, and then on completion, the user count will decrement from 1 to 0. This will trigger the task to be automatically deleted due to there being no users.

Later, when A frees up and begins to process the task, there is a segfault because the task is already freed.

I'm writing an issue to describe more about the solution.

leonardo-albertovich · 2023-02-14T10:25:15Z

The explanation is clear and it seems to make sense to me. I'd be lying if I said I have a clear picture of everything in that part of the code but I trust you have properly done the RCA and tested your patch as usual.

One thing I noticed while working on PR #6852 is that the users field in the flb_task structure is a signed integer and neither flb_task_users_inc nor flb_task_users_dec check it to ensure that it's within boundaries in addition to flb_task_users_release only accepting zero as a valid way signal that the task is ready to be disposed of (which is correct from a theoretical standpoint and also less prone to cause illegal memory accesses in exchange to leaving the door open to leaving some chunks behind that would be re-sent when the system is restarted but only on "catastrophic" situations).

I don't think it's a huge issue but I thought I'd mention it to get some additional eyes on it, I don't think adding a conditional in both of those would have a dramatic effect on performance and if we add an error message in cases where we detect this it could help us find some rare and nasty bugs (if they exist).

matthewfala · 2023-02-15T20:59:15Z

Thank you for your review and comments, @leonardo-albertovich

matthewfala · 2023-03-02T23:12:58Z

@edsiper, would it be possible to get these both merged in?
Master: #6818
1.9: #6817

It affects cloudwatch stability

If you want to wait for us to have a list of final prs for 1.9 before merging them in, please let me know and we can hold off till then.

leonardo-albertovich · 2023-03-03T08:17:36Z

If these PRs are to be merged (which I think is the case) and the same bug is present in 2.0 I think it would be important to backport it and merge all three at the same time, could you do that so we can wrap this up asap?

matthewfala · 2023-04-12T20:13:28Z

Hi Leonardo, I realized recently that this PR is still pending merge. We've been cherry picking it into our aws-for-fluent-bit container builds for the last 2 major versions, so I think it's relatively stable.

Here's the other related PRs:
master: this pr
2.0: #7170
1.9: #6817

Not sure about the 1.9 backport. We can close it if needed and continue cherrypicking for our builds.

matthewfala · 2023-04-12T20:13:43Z

@leonardo-albertovich

leonardo-albertovich · 2023-04-13T10:45:31Z

I just re-scheduled a test that failed due to unrelated reasons in the 2.0 backport and once CI is done I'll merge it.

matthewfala requested review from edsiper, leonardo-albertovich, fujimotos and koleini as code owners February 9, 2023 21:57

github-actions bot added the docs-required label Feb 9, 2023

matthewfala temporarily deployed to pr February 9, 2023 21:57 — with GitHub Actions Inactive

matthewfala temporarily deployed to pr February 9, 2023 22:16 — with GitHub Actions Inactive

edsiper assigned leonardo-albertovich Feb 10, 2023

PettitWesley previously approved these changes Feb 14, 2023

View reviewed changes

core: sync scheduler resolve multi task user issue

61b320c

Signed-off-by: Matthew Fala <falamatt@amazon.com>

matthewfala dismissed PettitWesley’s stale review via 61b320c February 14, 2023 01:37

matthewfala force-pushed the sync-scheduler-fix branch from 1eea924 to 61b320c Compare February 14, 2023 01:37

matthewfala temporarily deployed to pr February 14, 2023 01:38 — with GitHub Actions Inactive

matthewfala temporarily deployed to pr February 14, 2023 01:55 — with GitHub Actions Inactive

matthewfala mentioned this pull request Feb 14, 2023

Sync Core Scheduler: cloudwatch_logs duplicate tag match SIGSEGV #6849

Closed

leonardo-albertovich approved these changes Feb 14, 2023

View reviewed changes

matthewfala mentioned this pull request Apr 12, 2023

core: sync scheduler resolve multi task user issue #7170

Merged

7 tasks

leonardo-albertovich merged commit ed67f03 into fluent:master Apr 13, 2023

PettitWesley mentioned this pull request May 10, 2023

out_cloudwatch: Segmentation fault #6616

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: sync scheduler resolve multi task user issue #6818

core: sync scheduler resolve multi task user issue #6818

matthewfala commented Feb 9, 2023 •

edited

Loading

edsiper commented Feb 10, 2023

leonardo-albertovich commented Feb 10, 2023

PettitWesley left a comment

matthewfala commented Feb 14, 2023 •

edited

Loading

leonardo-albertovich commented Feb 14, 2023

matthewfala commented Feb 15, 2023

matthewfala commented Mar 2, 2023

leonardo-albertovich commented Mar 3, 2023

matthewfala commented Apr 12, 2023

matthewfala commented Apr 12, 2023

leonardo-albertovich commented Apr 13, 2023

core: sync scheduler resolve multi task user issue #6818

core: sync scheduler resolve multi task user issue #6818

Conversation

matthewfala commented Feb 9, 2023 • edited Loading

edsiper commented Feb 10, 2023

leonardo-albertovich commented Feb 10, 2023

PettitWesley left a comment

Choose a reason for hiding this comment

matthewfala commented Feb 14, 2023 • edited Loading

leonardo-albertovich commented Feb 14, 2023

matthewfala commented Feb 15, 2023

matthewfala commented Mar 2, 2023

leonardo-albertovich commented Mar 3, 2023

matthewfala commented Apr 12, 2023

matthewfala commented Apr 12, 2023

leonardo-albertovich commented Apr 13, 2023

matthewfala commented Feb 9, 2023 •

edited

Loading

matthewfala commented Feb 14, 2023 •

edited

Loading