Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor cni telemetry #3149

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

feat: refactor cni telemetry #3149

wants to merge 17 commits into from

Conversation

QxBytes
Copy link
Contributor

@QxBytes QxBytes commented Nov 14, 2024

Reason for Change:

Currently the telemetry CNI is sending is insufficient to debug CNI issues. This PR refactors the cni telemetry to send more and better quality logs.

  • Moves telemetry into a package level variable so it is made accessible everywhere
  • Removes sending certain metrics as they are not used
  • Sets the subcontext to the container id. The container id is kept consistent throughout CNI calls for the same pod, meaning an ADD and DEL call (and all related logs) for the same pod will have the same subcontext/container id. The container id is also what is stored in stateless mode as one of the keys.
  • Sets the operation id before any telemetry events are sent. The operation id is used for sampling should we end up enabling it.

Examples of Logged information (Will be added in a separate PR-- this PR is focused on refactoring)

  • CNI add network configuration, arguments
  • CNI add completion with endpoint info struct information (contains hns endpoint id and hns network id), interface results from the ipam invoker, and any error that occurred
  • CNI del network configuration, arguments
  • CNI del completion with error that occurred
  • HNS Endpoint struct before creation / HNS Endpoint Id during deletion
  • HNS Network struct before creation / HNS Network Id during deletion
  • Deletion/Release of each IP (even if does not exist)
  • Mapping sent to CNS during stateless CNI mode during Update Endpoint State
  • Exact CNS response from CNS ipam invoker
  • Exact CNS response from multitenancy ipam invoker
  • Transparent vlan creating/deleting vlan veth interface

Potential additions:

  • endpoint and network structs saved to azure-vnet.json statefile

Issue Fixed:

Requirements:

Notes:
Pipeline run to prove logs sent to kusto: https://msazure.visualstudio.com/One/_build/results?buildId=108208651&view=results
Passing run: https://msazure.visualstudio.com/One/_build/results?buildId=108563465&view=results

@QxBytes QxBytes changed the title ci: refactor cni telemetry feat: refactor cni telemetry Nov 14, 2024
@QxBytes QxBytes self-assigned this Nov 14, 2024
@QxBytes QxBytes added cni Related to CNI. ci Infra or tooling. telemetry logging labels Nov 14, 2024
@QxBytes QxBytes marked this pull request as ready for review November 14, 2024 23:56
@QxBytes QxBytes requested review from a team as code owners November 14, 2024 23:56

func (c *TelemetryClient) SendEvent(msg string) {
c.sendLog(msg)
c.sendTelemetry(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between log and telemetry here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logs are sent to azure-vnet.log. Telemetry is sent to the telemetry service on the node which then sends it to application insights.

}
}

func (c *TelemetryClient) ConnectTelemetry(logger *zap.Logger) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't this be part of the constructor? wouldn't that remove the need for nil checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ConnectTelemetry in NewTelemetryClient I believe we would try to connect to the telemetry service during tests, which I do not want to occur. Connect / StartAndConnect are also called by different main methods (stateless vs. non-stateless cni) whereas NewTelemetryClient is called for both. Adding attempts to only connect to NewTelemetryClient (via ConnectTelemetry) would be incorrect behavior in non-stateless cni because we want to start the telemetry service before any connection is attempted.

@QxBytes
Copy link
Contributor Author

QxBytes commented Nov 15, 2024

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

we will split this part of the pr into its own pr
a telemetry event was added back which was previously removed
undo this pr to add those telemetry statements back
cni/network/plugin/main.go Outdated Show resolved Hide resolved
cni/network/stateless/main.go Outdated Show resolved Hide resolved
remove reflect
remove duplicated telemetry and telemetry buffer
remove unused fields in report manager
force access to telemetry client fields through methods
move telemetry start/connect code closer to start of plugin execution
we use SendError where we would have previously called reportPluginError (no log emitted)
we don't set error message in cni report because the error message and event message fields both end up in the Message field in the cni telemetry service
@timraymond
Copy link
Member

LGTM on @ramiro-gamarra 's approval

@@ -479,18 +436,12 @@ func (plugin *NetPlugin) Add(args *cniSkel.CmdArgs) error {
zap.String("pod", k8sPodName),
zap.Any("IPs", cniResult.IPs),
zap.Error(log.NewErrorWithoutStackTrace(err)))

telemetry.SendEvent(fmt.Sprintf("ADD command completed with [ipamAddResult]: %s [epInfos]: %s [error]: %v ", ipamAddResult.PrettyString(), network.FormatStructPointers(epInfos), err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per one of your previous comments, telemetry.SendEvent will both write to azure-vnet.log as well as send to telemetry service. Aren't we then writing to the log file twice here if logger also writes to the file? Can we keep this PR to only a pure refactor without introducing new side effects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are correct, I've removed the duplicate logger statement in the telemetry client


defer func() {
logger.Info("DEL command completed",
zap.String("pod", k8sPodName),
zap.Error(log.NewErrorWithoutStackTrace(err)))
telemetry.SendEvent(fmt.Sprintf("DEL command completed: [released ip]: %+v [podname]: %s [namespace]: %s [error]: %v", nwCfg.IPAM.Address, k8sPodName, k8sNamespace, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. if logger already writes to the file, then we are just duplicating data, as well as going away from structured logs.

Copy link
Contributor Author

@QxBytes QxBytes Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed above-- I was trying to replicate what I was seeing in the cnslogger but since we log before anyway this isn't helpful

Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may still be missing some details about the purpose of this refactor, but seems to me that logs are getting duplicated and the abstractions introduced are not cleaning up the code much yet.

telemetryWaitTimeInMilliseconds = 200
)

type TelemetryClient struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of stutter in the naming. Can't there just be a telemetry package with a Client type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the telemetry client to the existing telemetry package

lock sync.Mutex
}

type TelemetryInterface interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only a single implementation of this interface. This is not needed.

cniReportSettings *telemetry.CNIReport
tb *telemetry.TelemetryBuffer
logger *zap.Logger
lock sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is CNI spinning goroutines? What is this lock serializing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we don't send telemetry from multiple goroutines as far as I can tell, but if we want to send telemetry from anywhere in the codebase in the future, I would prefer to serialize this and not worry about races etc. later, but I can remove the lock for now if you would like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Infra or tooling. cni Related to CNI. logging telemetry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants