Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a log receiver on CAPI MCs to be able to receive teleport audit logs and send them to Loki and customers SIEMs #3343

Open
3 tasks
QuentinBisson opened this issue Mar 21, 2024 · 19 comments
Assignees
Labels

Comments

@QuentinBisson
Copy link

QuentinBisson commented Mar 21, 2024

Towards #3250

Motivation

As Giant Swarm employees will not be able to ssh onto nodes and clusters using bastions hosts but teleport, we need to be able to collect and store the teleport logs in Loki and potentiall in customer's SIEMs.

To that end and after some discussions with @giantswarm/team-bigmac #3250 (comment) we decided that we would add a fluent-bit log receiver on the Management clusters that would receive the teleport logs and send the data wheever is needed (loki and SIEM if needed)

Acceptance Criteria

  • Given input logs getting received by teleport, the logs are ingested in Loki

Dependencies

Dependencies

Preview Give feedback

Tasks

Preview Give feedback
@Rotfuks
Copy link
Contributor

Rotfuks commented Oct 15, 2024

From @TheoBrigitte in the epic level:

Regarding Teleport audit events, there is a way to export events from our Teleport cloud and ship then into Fluentd (or probably any other JSON compatible log ingester ... Alloy 👀), but exported events are from all installations and I haven't found a way to filter events from the teleport side.

I did follow the Export Events with Fluentd guide, and used the identity file from kg get secret abc-identity-output -oyaml|yq -r .data.identity|base64 -d> identity.
We use Teleport cloud, meaning our Auth service is hosted at teleport.giantswarm.io, so events are produced at the cloud level. The teleport-event-handler would then connect to the Auth service and retrieve the events it contains.

Useful links:

https://goteleport.com/docs/admin-guides/management/export-audit-events/fluentd/
https://goteleport.com/docs/reference/monitoring/audit/#events
https://goteleport.com/docs/reference/access-controls/roles/#filter-fields

@Rotfuks
Copy link
Contributor

Rotfuks commented Nov 26, 2024

  • We can already receive teleport audit logs with the log receiver we created, we just need to configure it
  • We can already send data to any customer SIEM that supports OLTP

So in the scope of this ticket what's missing is:

  • Ingest Teleport audit logs
  • Check with spyros he can help here :)

@TheoBrigitte
Copy link
Member

I have some concerns towards security regarding this, as the current solution would expose all Audit Events in our Teleport account to every installation, meaning that any customer could potentially get access to any other customer's Audit Events. The filtering we can do in fluentd, alloy, or whatever we use would happen to late and could easily be worked around to get access to all Audit Events.
I opened an issue at Teleport to see if they have an idea on this topic: gravitational/teleport#49582

@QuentinBisson
Copy link
Author

@TheoBrigitte I think there's a confusion here. The goal of this story is to enable the log receiving endpoint, but the export to the correct installation (filtering of correct logs) will be done in the teleport cluster by @giantswarm/team-shield . We only need to provide a secure endpoint for access from the teleport cluster. This will most likely require cooperation from @ssyno as we should be able to restrict tihs ingress only to the teleport VNET?

@TheoBrigitte
Copy link
Member

Teleport Audit Events will be ingested onto GiantSwarm's teleport cluster, the filtering will happen their and Audit Events will then be dispatched to different installations.

@ssyno
Copy link

ssyno commented Dec 3, 2024

The audit events already exist inside our teleport cluster, we just have to figure out a way to ship them to the log storage of each management cluster(so we can avoid our customers interacting with our teleport cluster). Teleport has a plugin called teleport-plugin-event-handler, this is supposed to work with fluentd. We started working on this in the past, but it is stale for a while now

@QuentinBisson QuentinBisson self-assigned this Dec 9, 2024
@QuentinBisson
Copy link
Author

@giantswarm/team-shield coming back to this :)

From what i understood we need to create a teleport event handler for each installation (that is a fluentd basically) and this requires you to work on it.

Each teleport event handler would need to be able to forward logs (in loki format) to our managed Loki per installation.

The rough schema would be:

teleport event handler -> fluentd forwarded -> observability gateway -> loki

Now my questions are:

  • How do we secure the log forwarding from fluentd observability gateway? Can we use some kind of teleport VNET? The gateway is currently behing 1 ingress that supports OIDC based on dex but I image we can create another ingress for teleport only? @stone-z I assume you have thoughts :D
  • Where would the fluentd config lie so we could work on crafting it together?

@TheoBrigitte
Copy link
Member

The teleport event handler basically spits out JSON, anything that understand JSON should be able to ingest the data coming from the event handler.

I see this a bit differently, shield sets up the teleport event handler which then talk to "something" atlas provide which might be Alloy or Fluentd we have to verify this. So this would be

teleport event handler -> alloy or fluentd -> loki

I think the alloy/fluentd config is then on us (atlas).

@QuentinBisson
Copy link
Author

Yes the event handler would communicate with the observability gateway, but the event handler does not support loki yet only fluentd and it should run on the teleport cluster to filter per installation

@ssyno
Copy link

ssyno commented Dec 10, 2024

I think that we can skip the step of the teleport-event-handler, we should be able to scrape and ship logs directly from teleport-cluster with alloy or promtail and ship them to the right loki

@QuentinBisson
Copy link
Author

Do you have the handler running on the teleport cluster yet?

@ssyno
Copy link

ssyno commented Dec 10, 2024

nope, I think we should try doing it without the event-handler

@QuentinBisson
Copy link
Author

Do you want us to have a quick meet on thursday to figure out what we can do?

@QuentinBisson
Copy link
Author

@ssyno and I had a talk. We plan to:

  • Shield will output audit events / logs in a file on the teleport cluster
  • Shield will create a teleport VNET to open connection from teleport cluster to all observability gateways (certificate authentication is managed by teleport so we do not need api keys on loki)
  • Atlas will deploy alloy on the teleport cluster through terraform
    • Alloy will filter out logs per installation and send the logs to the endpoints based on the log content

Open questions:

  • How can we handle alloy config dynamically (loki.write section per MC). Do we add a proxy? Do we have a static list of MCs in the teleport cluster to use? What is the source of MCs in the teleport cluster?
  • How can we access the teleport cluster? Yes we can
  • How do we handle monitoring of this?

Investigation:

  • Let's try to run this on golem (have audit events in a file and vnet) with a manual alloy deployment

@giantswarm/team-shield and @giantswarm/team-atlas what are your thoughts on this solution?

@stone-z
Copy link
Contributor

stone-z commented Dec 12, 2024

This sounds good to me 👍

It would be nice to dogfood and use the same route that customers use (including ingress), but the vnet approach also makes sense and would handle private environments as well, so no blocker from me.

Some suggestions / considerations:

  • We must have very high confidence that we are sending only the correct logs to each MC. Ideally there are some tests for that.
  • The log volume for only teleport logs is likely not that high, but each service that runs through the VNET adds to the demand on those components. This would be a good time to double check that there is observability of the performance to make sure teleport or MC components don't fall over and lock us out.

@QuentinBisson
Copy link
Author

We were mostly concerned about private MCs when we talked about the vent and i'm not sure we could provide oidc for the log shipper without spire anyway :)
I'm considering whether our agent should not also use the vnet route but that's a whole other discussion

I'm totally aligned with your other points

@QuentinBisson
Copy link
Author

Hey @ssyno do you think we could kickstart this on your side this month?

@ssyno
Copy link

ssyno commented Jan 7, 2025

I did some research the previous days:
Audit logs are there inside the Teleport-auth pods, in this format:

2025-01-07T14:55:32Z INFO [AUDIT]     kube.request addr.remote:10.0.3.115:43858 cluster_name:teleport.giantswarm.io code:T3009I ei:0 event:kube.request kubernetes_cluster:golem kubernetes_groups:[system:masters system:authenticated] kubernetes_users:[system:serviceaccount:default:automation] login:ssyno namespace:default proto:kube request_path:/apis/apiextensions.k8s.io/v1/customresourcedefinitions resource_api_group:apiextensions.k8s.io/v1 resource_kind:customresourcedefinitions response_code:200 server_hostname:teleport.giantswarm.io server_id:6ea64372-dd29-4018-b61b-2c343ee714d6 server_version:16.1.7 sid: time:2025-01-07T14:55:32.936Z uid:3e4f8dee-f5ed-463d-a9d7-a2d2c05c5efc user:ssyno user_kind:1 verb:GET events/emitter.go:288

I have enabled Teleport VNet for loki-write in Golem at loki-golem.teleport.giantswarm.io
https://github.com/giantswarm/giantswarm-management-clusters/commit/e8443989981bb69b58a05da7706cf345762dcd27

I think we are ready to move forward

@QuentinBisson
Copy link
Author

Awesome so as long as we only restrict log collection to the teleport pods and filter based on the kubernetes cluster then we should be good to go :) I'll do some testing on the teleport cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants