Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_dog: partially revert recent datadog PR to avoid provider ecs segfault #6785

Open
wants to merge 1 commit into
base: 1.9
Choose a base branch
from

Conversation

matthewfala
Copy link
Contributor

We noticed that the recent datadog pr triggers a segfault when provider option is set to ecs. After a lot of investigation, we were unable to find the root cause of the segfault, but discovers that it exists during a some random network call, which has nothing to do with the error handling code added in the PR, that when removed, resolves the segfault.

As a solution, we partially revert the recent Datadog pr that mysteriously triggers this segfault. It is just some simple error handling code that was recently added. We also add the data buffer resize fix from here: #6570

Partial revert provider ecs code of Datadog recent pr that triggers segfaults:
#5930
#5929

See the segfault reports in aws-for-fluent-bit repo here:
aws/aws-for-fluent-bit#491

Signed-off-by: Matthew Fala falamatt@amazon.com


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Signed-off-by: Matthew Fala <falamatt@amazon.com>
@matthewfala matthewfala temporarily deployed to pr February 4, 2023 01:40 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr February 4, 2023 01:41 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr February 4, 2023 01:55 — with GitHub Actions Inactive
@nokute78
Copy link
Collaborator

nokute78 commented Feb 5, 2023

aws/aws-for-fluent-bit#491 (comment)

[2022/12/07 10:11:00] [error] [src/flb_sds.c:109 errno=12] Cannot allocate memory
[2022/12/07 10:11:00] [error] [plugins/out_datadog/datadog.c:184 errno=25] Inappropriate ioctl for device

I think this issue is caused by following condition case.
https://github.com/fluent/fluent-bit/blob/v1.9.10/plugins/out_datadog/datadog.c#L181-L182
flb_sds_len(remapped_tags) - byte_cnt is negative since byte_cnt is greater than flb_sds_len(remapped_tags).
Then flb_sds_increase will fail.

On current master, it was fixed by #6750 and it has not been released yet.

@@ -179,7 +179,7 @@ static int datadog_format(struct flb_config *config,
return -1;
}
} else if (flb_sds_len(remapped_tags) < byte_cnt) {
tmp = flb_sds_increase(remapped_tags, flb_sds_len(remapped_tags) - byte_cnt);
tmp = flb_sds_increase(remapped_tags, byte_cnt - flb_sds_len(remapped_tags));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, the 2nd arg to flb_sds_increase is the new size, so shouldn't that be byte_cnt? And we shouldn't subtract anything from it?

Comment on lines -78 to +52
static int dd_remap_container_name(const char *tag_name,
msgpack_object attr_value, flb_sds_t *dd_tags_buf)
static void dd_remap_container_name(const char *tag_name,
msgpack_object attr_value, flb_sds_t dd_tags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

help me understand again, why are we reverting these changes which probably still are a real fix? Also if we are going to revert them, why not put it in a revert commit? Why same commit as this new fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the individual parts of the datadog pr and it turns out that completely unrelated error handling code was triggering a segfault somewhere random in the code (like on a network call).

We decided that this PR would keep only the essential portions of the recent PR and ditch the lower priority ones to avoid adding the segfault

@@ -179,7 +179,7 @@ static int datadog_format(struct flb_config *config,
return -1;
}
} else if (flb_sds_len(remapped_tags) < byte_cnt) {
tmp = flb_sds_increase(remapped_tags, flb_sds_len(remapped_tags) - byte_cnt);
tmp = flb_sds_increase(remapped_tags, byte_cnt - flb_sds_len(remapped_tags));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the data buffer resize fix from here: #6570

Comment on lines -78 to +52
static int dd_remap_container_name(const char *tag_name,
msgpack_object attr_value, flb_sds_t *dd_tags_buf)
static void dd_remap_container_name(const char *tag_name,
msgpack_object attr_value, flb_sds_t dd_tags)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the individual parts of the datadog pr and it turns out that completely unrelated error handling code was triggering a segfault somewhere random in the code (like on a network call).

We decided that this PR would keep only the essential portions of the recent PR and ditch the lower priority ones to avoid adding the segfault

@github-actions
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants