Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement proposal for Network Observability #921

Merged

Conversation

stleerh
Copy link
Contributor

@stleerh stleerh commented Oct 5, 2021

Network Observability introduces a new category in OpenShift that provides networking information for a single cluster. It gives insight into what's on the network, when and what types of traffic and traffic flows are being made, and by whom. It gathers data to help design, plan, and answer questions about the network and provides a visual representation to help understand, diagnose, and troubleshoot networking issues.

@aravindhp
Copy link
Contributor

/uncc

@openshift-ci openshift-ci bot removed the request for review from aravindhp October 5, 2021 17:11
@stleerh
Copy link
Contributor Author

stleerh commented Oct 5, 2021

/retest

@jotak
Copy link
Contributor

jotak commented Oct 6, 2021

/cc

@openshift-ci openshift-ci bot requested a review from jotak October 6, 2021 05:29
@stleerh
Copy link
Contributor Author

stleerh commented Oct 6, 2021

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2021

@stleerh: GitHub didn't allow me to request PR reviews from the following users: bbennett, amorenoz.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @russellb @mcurry-rh @bbennett @amorenoz @eraichst @eparis @spadgett

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stleerh
Copy link
Contributor Author

stleerh commented Oct 6, 2021

/cc knobunc

@openshift-ci openshift-ci bot requested a review from knobunc October 6, 2021 15:05
information (e.g. pod, services, namespaces) and then saved in internal or
external storage.

The web console will provide a NetFlow table showing traffic between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to contribute this through an OpenShift console dynamic plugin or directly to the openshift/console repo?

Some background on dynamic plugins:
https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md
https://github.com/openshift/console/tree/master/frontend/packages/console-dynamic-plugin-sdk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenShift console dynamic plugin


#### OVS
Open vSwitch (OVS) is the NetFlow exporter in the OpenShift cluster. When
Network Observability is enabled, each OVS in the pod will be configured to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean:

Suggested change
Network Observability is enabled, each OVS in the pod will be configured to
Network Observability is enabled, each OVS in the cluster will be configured to

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, shouldn't it be node?



### Operators
Two new operators will be added to OCP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small remark, the sentence "two new operators will be added" could be misunderstood, we're not going to develop the loki operator, just reuse the one developed and maintained by the cluster logging team, though we will expect the users to deploy a distinct instance.
So we really create 1 operator in terms of development.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

stleerh and others added 3 commits October 7, 2021 12:57
Network Observability introduces a new category in OpenShift that
provides networking information for a single cluster. It gives insight
into what's on the network, when and what types of traffic and traffic
flows are being made, and by whom. It gathers data to help design, plan,
and answer questions about the network and provides a visual
representation to help understand, diagnose, and troubleshoot networking
issues.
Clarify that Loki Operator is a separate project.
Also, squash commits into one for initial PR.  Separate commits after
initial PR.
what's happening on their network. Monitoring provides metrics and
alerts to potential problems. Network observability will then help you
analyze, investigate and diagnose those problems by looking at it from
a control plane perspective instead of device by device. In addition,
Copy link

@amorenoz amorenoz Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to make a rather generic observation here:

At lest to me, Network Observability can mean two things (maybe it means both at the same time):
1 - Observing the network traffic
2 - Observing the network logic

What we're doing here is adding Traffic Network Observability since we're capturing information of packets that actually go through the network. But, in order to understand why that traffic went through (or why not), we would have to visualize the network's logic. That means observing 4 layers of network logic (from more to less abstract):

  • k8s/ ovn-k8s level logic (probably covered by other parts of Openshift Console): What networks/pods/services/policies, etc are configured and how they relate to each other
  • OVN logic: how the k8s logic gets translated into a set of Logical Routers, Logical Switches, Load Balancers, ACLs, etc and how these objects are connected to each other and to the k8s nodes (Chassis).
  • OVS logic: the Openflow flows that OVN configures into OVS and define what it should do with packets as they come in and out of certain openflow bridges [1] (which are a logical entity that groups openflow flows together).
  • Datapath logic: the datapath flows configured by OVS into the OVS kernel datapath. This is the logic that actually get's applied to flows that come through the system.

So, whenever someone wants to understand why a packet is being dealt in a specific way, they'll probably have to navigate some of these network logic layers.

Now, I think it's OK to focus this enhancement on network traffic observability but I'd say we should not forget about the rest. For instance, the process of determining what logic to program into the virtual network is commonly referred to as "control plane" so this sentence, in this particular context, seems a bit confusing to me.

[1] I'll refer to this logical bridge entity in another comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in line with what Russell says, we all agree that Network Observability is broader than just observing the network traffic and that this proposal only addresses one aspect of the whole.
Currently our focus is clearly on traffic observability but we don't want to loose track on the network logic (it's tracked in jira :) )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stleerh we have this in JIRA, but indeed, this area could as well be mentioned in the "future work" section that Russell suggests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to focus on network traffic and then see how much we need to expand beyond this based on customer feedback.

The idea is that this enhancement looks at it from a cluster point-of-view instead of a device point-of-view.  Perhaps the better term to use is "cluster" instead of "control plane".


### Implementation Details/Notes/Constraints

Here are some of the limitations and constraints.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the IPFIX exporter be configured in all OVS bridges [1]?
If the answer is no, I think there is a limitation of certain flows not being visible.

More generally, is the concept of "logical bridge" going to be abstracted away from the customer visualization? For instance, if you sample at two bridges, A and B, that are connected to each other you might detect a netflow flow on ingressA, that same flow on egressA, again on ingressB and finally egressB. The way these flows are de-duplicated can be considered an implementation detail but I'd like to understand how much "logical" visualization are we adding here.

[1] See other comment above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, at the moment only the int bridge is going to be used (ie. no change regarding the current implementation of enabling flow exports on int bridge from the ovn-k layer).
I agree it can be mentioned in the limitations, and potentially mentioned as an area of improvement in "future work"

### Visualization
This is the proposal for the NetFlow table.

<https://marvelapp.com/prototype/h4fei7h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the final set of columns or will more be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the current proposal and it intentionally kept the list small for simplicity.  If there are other attributes that you feel are important, please suggest adding them to the list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One field comes to mind: the Direction field which can have 2 values INGRESS/EGRESS.
Some flows are only visible on one of the directions, for instance if Pod A accesses Service S that has Pod B as an endpoint you would see:
Pod A -> Service S INGRESS
Pod A -> Pod B EGRESS

How would this be shown?

  • If no Direction column is shown, it could lead to the (wrong) conclusion that Pod A is generating twice as much traffic.
  • If we enforce one direction (i.e. INGRESS) and hide the other we would loose the perspective on the load balancing taking place.

Additionally, OVS provides the port name information. This might need translation (from port name to pod name) and could be considered redundant if we have the Pod Src/Dst Names derived from the IP addresses. But, looking at the amount of traffic that is sent to the geneve port might give you an idea of whether your application is well scaled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this comment.  We will add the Direction field.  Regarding the port name, let's see if that will be necessary.

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general comment, the first few sections cast more high-level aspirational vision about network obvservability and then later in the doc it starts getting into a specific proposal about network flow collection and visualization. I would try to focus the document more on what's proposed to be implemented and not try to capture the broad potential scope of network observability overall.

You could consider a section on "future work" where you allude to what might come next, or what might build on the work proposed in this enhancement.

approvers:
- "@russellb"
- "@mcurry-rh"
- "@bbennett"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update this to ben's github username @knobunc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fab47eb

Network Observability introduces a new category in OpenShift that
provides networking information for a single cluster. It gives insight
into what's on the network, when and what types of traffic and traffic
flows are being made, and by whom. It gathers data to help design, plan,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit, but how would you differentiate "traffic" and "traffic flows"? Should this just be "traffic flows" ?

Copy link
Contributor Author

@stleerh stleerh Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traffic = IANA service name
Traffic flow = the actual NetFlow data which includes the traffic

It could just be "traffic flow" but wanted to call out the traffic part.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents: if we stick with IPFIX terminology, that would be "flow template" and "flow record"

## Motivation

With Kubernetes, a layer of abstraction is added making it difficult for
Red Hat and customers who managed their networks to be able to fully see
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Red Hat and customers who managed their networks to be able to fully see
Red Hat and customers who manage their networks to be able to fully see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fab47eb

The goal of the first release is to lay the groundwork and foundation
in place, while still being able to deliver some functionality, even
if it is at a smaller scale. The target is to have a Dev Preview to
generate interest in network observability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary, motivation, and goals are all pretty high level and aspirational. When we get down into the Proposal, we get down to the proposed initial scope. I think it would be helpful to clarify earlier in the doc that while Network Observability can be quite broad, you've explicitly chosen this subset to work on in this iteration. It could be worth explaining why network flow collection and visualization was chosen as the next area to implement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in Proposal in fab47eb

by a user with an admin or cluster-admin role. This is done by installing
the Network Observability Operator, which in turn, installs the dependency
operators necessary for this feature. The user can do this using the
web console or the CLI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you summarize the dependencies here, as well? I wondered what "dependency operators" were referred to here, and then found the details in the later design details section. I think a quick summary here would still be helpful of the major components you will depend on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be updated to say it will require installation of Loki Operator.  Similar to Cluster Logging, the user will need to install each one separately.

#### Network Observability Operator (NetObserv)
The Network Observability Operator will need to be installed from OperatorHub
to enable this feature. This operator has a dependency on the Loki
Operator. This is an OpenShift Console dynamic plugin that is responsible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a console dynamic plugin? Not the operators ... so not sure what this refers to exactly.

Do you mean to say that the operator includes a console dynamic plugin? I'd look into how dynamic plugins are delivered, as I think they may need to be included in the console and not delivered as part of the operator? I'm not positive on that though.

It would help for this section to more clearly explain what the new operator will do to justify it. So far, I can guess:

  • ensure the loki operator is installed?
  • ensure a loki instance is installed and configured properly?
  • automatically turn on and configure flow export from OVS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will follow the guidelines at https://github.com/openshift/console/tree/master/frontend/packages/console-dynamic-plugin-sdk.  I will add the link to this document and ultimately, another link for the Network Observability Operator.

***Side note:***<br>
This is a move away from Elasticsearch due to licensing restrictions.
Open Distro for Elasticsearch was considered, but Loki was favored due to other
components that plan to ultimately use Loki.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is going to own loki and the loki operator and ensure those get shipped and supported appropriately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster Logging team.  Working on this...


Measuring ROI on network observability is difficult so it might be hard
to justify the cost and resources to deploy this. It attempts to find
and resolve issues that you might not know exists. The value you get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and resolve issues that you might not know exists. The value you get
and resolve issues that you might not know exist. The value you get

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fab47eb

to justify the cost and resources to deploy this. It attempts to find
and resolve issues that you might not know exists. The value you get
may not be obvious because it is hard to know how much you saved by
preventing ransomware or by allocating the right amount of resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"preventing ransomware" was a bit out of left field for me, since security isn't really a focus throughout the enhancement

Copy link
Contributor Author

@stleerh stleerh Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed text:
The value you get may not be obvious because it is difficult to calculate how much you save by preventing something from happening such as a network failure.


## Alternatives

Sticking with and enhancing traditional pinpoint tools.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are "traditional pinpoint tools" ?

You've also got some "alternatives considered" content scattered throughout that could all be moved here, like the mention of ElasticSearch, or the use of existing operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed text:
Provide scripts and standalone applications to enhance traditional pinpoint tools like pcap, traceroute, netstat, etc.

I will move the other alternatives here.

@knobunc
Copy link
Contributor

knobunc commented Oct 20, 2021

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 20, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2021
@alibo
Copy link

alibo commented Oct 22, 2021

I know you're already working on implementing it here, but is there any chance to consider Pixie as an alternative approach?

@stleerh
Copy link
Contributor Author

stleerh commented Oct 25, 2021

I know you're already working on implementing it here, but is there any chance to consider Pixie as an alternative approach?

Pixie uses eBPF and that is something we will be looking at in the future as an alternative data source.  We still want to provide NetFlow as a choice since that is what a number of customers are familiar with and willing to enable.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2021
@stleerh
Copy link
Contributor Author

stleerh commented Nov 24, 2021

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2021
@knobunc
Copy link
Contributor

knobunc commented Dec 8, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021
@openshift-merge-robot openshift-merge-robot merged commit 8762e02 into openshift:master Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants