Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP POC Event categorization on Linux auth logs #9905

Closed
wants to merge 9 commits into from

Conversation

webmat
Copy link
Contributor

@webmat webmat commented Jan 4, 2019

To me, it's looking like the current Ingest Node processors make this task much too verbose.

General TODO

  • Add more kinds of auth logs (PAM, su, systemd)
  • Refactor syslog header parsing:
    • Now defined once, instead of in each pattern
    • Now always extracts process.name and .pid

Categorization TODO

  • All activity: event.kind:event
  • SSH authentication activity: event.category:authentication, event.action:ssh_login,
    event.outcome:(success|failure)
  • SSH session activity: event.category:session,
    event.action:ssh_session, event.outcome:(connect|disconnect)
  • SSH management activity (e.g. restarts): event.category:?, event.action:?
  • User & group management activity
  • sudo activity
  • su activity
  • PAM activity

@webmat webmat requested review from a team as code owners January 4, 2019 20:36
@webmat webmat self-assigned this Jan 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/secops

@webmat webmat added the in progress Pull request is currently in progress. label Jan 4, 2019
@webmat
Copy link
Contributor Author

webmat commented Jan 4, 2019

Based on the work in this proof of concept, we'll need a much more terse way to define these rules:

  • developer sanity: very few logs are categorized right now, and this has already required a lot of brittle code to be put in place
  • Ingest Node performance limitations: all of the if clauses are in Painless, and will count towards increasing the likelihood of causing script.max_compilations_rate errors.
    • We're already facing problems in CI with one Painless script and perhaps 1-2 if clauses per module. Going through with this approach will increase the Painless script count by more than an order of magnitude, I think. This will not be viable for CI, much less for production environments.
    • See Parameterizing Painless script #9821 for more information about the script.max_compilations_rate errors.

Some things that could help:

  • An if processor that takes an array of processor tasks (same semantics as the array under "processors":)
  • The ability to set multiple field values, when a grok is successful. Similar semantics to Logstash' add_field common option

@urso urso removed the request for review from a team January 5, 2019 15:02
@tsg
Copy link
Contributor

tsg commented Jan 6, 2019

Thanks for putting this together @webmat.

Based on the work in this proof of concept, we'll need a much more terse way to define these rules

Looking at the final pipeline, it looks reasonably short and actually quite readable.

There are for sure ways to improve the developer experience, and we should brainstorm about them, but based on the above I wouldn't say this is unfeasible in the current form. Could you explain perhaps a bit more what you found to be terrible?

Regarding the max_compilations_rate, we should discuss with the ES team if it might make sense to raise that soft limit for everyone in 7.0, but we can also have Beats raise it per cluster during setup, or something similar. It also feels like something more likely to affect CI rather than production, where every script is only compiled once. CC @jakelandis.

@webmat
Copy link
Contributor Author

webmat commented Jan 7, 2019

@tsg The reason I say it's not reasonable is that so far this PR so far is only doing categorization on 1/7 types of messages that I see in this module (there may be more than 7). Moreover, it's not doing so properly, as it incorrectly makes the assumption that anything SSH is about authentication attempts. But SSH messages can also contain admin stuff (restarts) and sessions (connects / disconnects). Given all this, I think the amount of code necessary is prohibitive *.

Another part that bugs and makes things brittle me is how each processor has to replicate a full if statement with a lot of duplicate checks (repeating other processor's similar if).

I'll work on the PR some more this week, to continue fleshing out the POC. I think it will better showcase the problems I see.

Note that we could take a completely different approach as well, and go with one single bigger Painless script. Multiline, indentation, etc. But we don't have good support for this in Beats yet.

* I did start this PR by doing a pretty good refactoring that extracts out the parsing of the syslog header. It used to be done by each grok pattern. Even if we end up dropping this POC, I'll extract the refactoring and submit it on its own. This refactoring is working against me making the case that this generates too much code, since I started with a code reduction ;-)

@webmat
Copy link
Contributor Author

webmat commented Jan 7, 2019

When evaluating this code, make sure you scroll right. Check out line 71 ;-)

@webmat
Copy link
Contributor Author

webmat commented Jan 7, 2019

Another note that's unrelated to Ingest Node verbosity is that we'll need to do de-duplication as well. In many cases, a single real world event (e.g. successful login) can translate to 3-4 log entries or more. See this successful login. We'll need to de-duplicate this, to reduce noise.

This won't be done with Ingest Node, but in the current plan, I'm not sure what would do this.

This comment is more of a note for later than a problem to be addressed in this POC, though.

@jakelandis
Copy link

Regarding the max_compilations_rate, we should discuss with the ES team if it might make sense to raise that soft limit for everyone in 7.0

I think the real is issue elastic/elasticsearch#37120. I doubt (but haven't looked) that you actually have that 75 scripts/templates that need to compile. I will look into getting that issue resolved very soon. I would prefer to not raise the limit unless we know we the limit will be hit after elastic/elasticsearch#37120 is resolved.

@webmat
Copy link
Contributor Author

webmat commented Jan 8, 2019

@jakelandis Ok, thanks for bringing this one up. We've bumped the Beats CI pipeline to 1000 compilations / minute for now 😆

I'll make sure nobody tries to "fix" any of the script and let them know about this issue

@webmat webmat mentioned this pull request Jan 10, 2019
13 tasks
Mathieu Martin added 9 commits January 14, 2019 11:06
Side effect is that now, all of these patterns now correctly populate `process.name`
- set event.kind (trivial)
- set event.category to authentication for all sshd activity
- move event.action to system.auth.ssh.action, to make room for normalized event.action
… ssh activity

Note: when I say event.action is too broad, this is because it lumps in
session disconnects as 'ssh_login', when it should likely be ssh_session.
tsg added a commit to tsg/beats that referenced this pull request Mar 20, 2019
This PR adds the following fields for the SSH login events:

* `event.category: authentication`
* `event.action: ssh_login`
* `event.type` either `authentication_success` or `authentication_failure`

The `event.outcome` is currently not quite ECS compliant, but I didn't touch it to
avoid a breaking change.

The PR doesn't attempt to categorize other logs besides the SSH login attempts,
so it's a subset of elastic#9905, but it's what we need for the UI.
tsg added a commit that referenced this pull request Mar 21, 2019
* Adding categorization fields for the system/auth module

This PR adds the following fields for the SSH login events:

* `event.category: authentication`
* `event.action: ssh_login`
* `event.type` either `authentication_success` or `authentication_failure`

The `event.outcome` is currently not quite ECS compliant, but I didn't touch it to
avoid a breaking change.

The PR doesn't attempt to categorize other logs besides the SSH login attempts,
so it's a subset of #9905, but it's what we need for the UI.

* Normalized event.outcome and brought back `system.auth.ssh.event`.

* changelog
tsg added a commit to tsg/beats that referenced this pull request Mar 21, 2019
* Adding categorization fields for the system/auth module

This PR adds the following fields for the SSH login events:

* `event.category: authentication`
* `event.action: ssh_login`
* `event.type` either `authentication_success` or `authentication_failure`

The `event.outcome` is currently not quite ECS compliant, but I didn't touch it to
avoid a breaking change.

The PR doesn't attempt to categorize other logs besides the SSH login attempts,
so it's a subset of elastic#9905, but it's what we need for the UI.

* Normalized event.outcome and brought back `system.auth.ssh.event`.

* changelog

(cherry picked from commit a9f567b)
tsg added a commit that referenced this pull request Mar 21, 2019
…m/auth module (#11363)

* Adding categorization fields for the system/auth module (#11334)

* Adding categorization fields for the system/auth module

This PR adds the following fields for the SSH login events:

* `event.category: authentication`
* `event.action: ssh_login`
* `event.type` either `authentication_success` or `authentication_failure`

The `event.outcome` is currently not quite ECS compliant, but I didn't touch it to
avoid a breaking change.

The PR doesn't attempt to categorize other logs besides the SSH login attempts,
so it's a subset of #9905, but it's what we need for the UI.

* Normalized event.outcome and brought back `system.auth.ssh.event`.

* changelog

(cherry picked from commit a9f567b)

* cleanup changelog
@exekias exekias removed the request for review from a team November 22, 2019 10:56
@andrewkroh andrewkroh closed this Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecs Filebeat Filebeat in progress Pull request is currently in progress. module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants