Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(userspace/falco): fixed falco_metrics::to_text implementation when running with plugins #3230

Merged
merged 2 commits into from
Jun 3, 2024

Conversation

FedeDP
Copy link
Contributor

@FedeDP FedeDP commented May 31, 2024

What type of PR is this?

/kind bug

Any specific area of the project related to this PR?

/area engine

What this PR does / why we need it:

This PR does 3 things:

  • avoid crashing when scap_machine_info or scap_agent_info are NULL (ie: with non-linux scap platform, ie: while running plugins)
  • properly use a shared_ptr for inspectors vector, so that we know they are kept alive while to_text runs and we can dereference the pointer
  • only expose metrics for actually enabled sources (not loaded ones). Before, in nodriver mode with syscall source disabled, we would've tried to emit metrics for the syscall source/inspector too, but that is wrong since it is not enabled (and it would crash Falco too!)

Which issue(s) this PR fixes:

Fixes #3229

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix(userspace/falco): fixed `falco_metrics::to_text` implementation when running with plugins

…hen running with plugins.

Signed-off-by: Federico Di Pierro <nierro92@gmail.com>
Copy link
Contributor

@incertum incertum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right yes we knew that it was not working for plugin only since the scap regression.
Thanks @FedeDP for adding the safety checks.

incertum
incertum previously approved these changes May 31, 2024
Copy link
Contributor

@incertum incertum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@incertum
Copy link
Contributor

@FedeDP do we also check for null in the output rule code? If not maybe we should add some checks in there too given we have the scap regression?

@incertum
Copy link
Contributor

yes @FedeDP please also adjust below code 🙏 since we will continue having the scap regression for a while.

void stats_writer::collector::get_metrics_output_fields_wrapper(
		nlohmann::json& output_fields,
		const std::shared_ptr<sinsp>& inspector,
		const std::string& src, uint64_t num_evts,
		uint64_t now, double stats_snapshot_time_delta_sec)
{
	static const char* all_driver_engines[] = {
		BPF_ENGINE, KMOD_ENGINE, MODERN_BPF_ENGINE,
		SOURCE_PLUGIN_ENGINE, NODRIVER_ENGINE, GVISOR_ENGINE };
	const scap_agent_info* agent_info = inspector->get_agent_info();
	const scap_machine_info* machine_info = inspector->get_machine_info();

@FedeDP
Copy link
Contributor Author

FedeDP commented May 31, 2024

/hold

@FedeDP
Copy link
Contributor Author

FedeDP commented May 31, 2024

yes @FedeDP please also adjust below code 🙏 since we will continue having the scap regression for a while.

For some reason it was not failing on that locally :/ weird, but yep i add the extra checks!

EDIT:

falco.start_ts=0

Mmmh this is outputted by the output_rule for the metrics; indeed i can confirm that both agent_info and machine_info are not NULL here.
Perhaps we are using the wrong inspector over there? Anyway, i'll add the extra check, and leave it as-is.

EDIT2: at a quick glance, it seems like we are using the correct inspector :/

Signed-off-by: Federico Di Pierro <nierro92@gmail.com>

Co-authored-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
@incertum
Copy link
Contributor

indeed i can confirm that both agent_info and machine_info are not NULL here.

they are not initialized when running Falco with plugin only, see #2821 :/

Copy link
Contributor

@incertum incertum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

std::vector<libs::metrics::libs_metrics_collector> metrics_collectors;

for (const auto& source_info: state.source_infos)
for (const auto& source: state.enabled_sources)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well it turns out the issue was actually here: the old code also used the syscall inspector that was not even inited i guess, and thus had NULL agent and machine infos.
That's why stats_writer was not crashing.

I think leaving the checks is ok anyway, more protection for free :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@incertum incertum May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh understood, great job spotting this and agreed more checks help here and also make it clearer that these 2 structs can be NULL in some circumstances.

@FedeDP
Copy link
Contributor Author

FedeDP commented May 31, 2024

/milestone 0.39.0
(will probably become 0.38.1)

@poiana poiana added this to the 0.39.0 milestone May 31, 2024
@FedeDP
Copy link
Contributor Author

FedeDP commented Jun 3, 2024

/milestone 0.38.1

@poiana poiana modified the milestones: 0.39.0, 0.38.1 Jun 3, 2024
@poiana
Copy link
Contributor

poiana commented Jun 3, 2024

@sgaist: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@poiana
Copy link
Contributor

poiana commented Jun 3, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FedeDP, incertum, LucaGuerra, sgaist

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [FedeDP,LucaGuerra,incertum]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@FedeDP
Copy link
Contributor Author

FedeDP commented Jun 3, 2024

/unhold

@poiana poiana merged commit 6687d50 into master Jun 3, 2024
33 checks passed
@poiana poiana deleted the fix/metrics_nodriver branch June 3, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Scraping prometheus metrics endpoint crashes falco process
5 participants