-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement proposal for Network Observability #921
Enhancement proposal for Network Observability #921
Conversation
/uncc |
/retest |
/cc |
@stleerh: GitHub didn't allow me to request PR reviews from the following users: bbennett, amorenoz. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc knobunc |
information (e.g. pod, services, namespaces) and then saved in internal or | ||
external storage. | ||
|
||
The web console will provide a NetFlow table showing traffic between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to contribute this through an OpenShift console dynamic plugin or directly to the openshift/console repo?
Some background on dynamic plugins:
https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md
https://github.com/openshift/console/tree/master/frontend/packages/console-dynamic-plugin-sdk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenShift console dynamic plugin
|
||
#### OVS | ||
Open vSwitch (OVS) is the NetFlow exporter in the OpenShift cluster. When | ||
Network Observability is enabled, each OVS in the pod will be configured to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean:
Network Observability is enabled, each OVS in the pod will be configured to | |
Network Observability is enabled, each OVS in the cluster will be configured to |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, shouldn't it be node?
|
||
|
||
### Operators | ||
Two new operators will be added to OCP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small remark, the sentence "two new operators will be added" could be misunderstood, we're not going to develop the loki operator, just reuse the one developed and maintained by the cluster logging team, though we will expect the users to deploy a distinct instance.
So we really create 1 operator in terms of development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
d545ae5
to
953912e
Compare
Network Observability introduces a new category in OpenShift that provides networking information for a single cluster. It gives insight into what's on the network, when and what types of traffic and traffic flows are being made, and by whom. It gathers data to help design, plan, and answer questions about the network and provides a visual representation to help understand, diagnose, and troubleshoot networking issues.
Clarify that Loki Operator is a separate project. Also, squash commits into one for initial PR. Separate commits after initial PR.
953912e
to
1128b21
Compare
what's happening on their network. Monitoring provides metrics and | ||
alerts to potential problems. Network observability will then help you | ||
analyze, investigate and diagnose those problems by looking at it from | ||
a control plane perspective instead of device by device. In addition, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to make a rather generic observation here:
At lest to me, Network Observability can mean two things (maybe it means both at the same time):
1 - Observing the network traffic
2 - Observing the network logic
What we're doing here is adding Traffic Network Observability since we're capturing information of packets that actually go through the network. But, in order to understand why that traffic went through (or why not), we would have to visualize the network's logic. That means observing 4 layers of network logic (from more to less abstract):
- k8s/ ovn-k8s level logic (probably covered by other parts of Openshift Console): What networks/pods/services/policies, etc are configured and how they relate to each other
- OVN logic: how the k8s logic gets translated into a set of Logical Routers, Logical Switches, Load Balancers, ACLs, etc and how these objects are connected to each other and to the k8s nodes (Chassis).
- OVS logic: the Openflow flows that OVN configures into OVS and define what it should do with packets as they come in and out of certain openflow bridges [1] (which are a logical entity that groups openflow flows together).
- Datapath logic: the datapath flows configured by OVS into the OVS kernel datapath. This is the logic that actually get's applied to flows that come through the system.
So, whenever someone wants to understand why a packet is being dealt in a specific way, they'll probably have to navigate some of these network logic layers.
Now, I think it's OK to focus this enhancement on network traffic observability but I'd say we should not forget about the rest. For instance, the process of determining what logic to program into the virtual network is commonly referred to as "control plane" so this sentence, in this particular context, seems a bit confusing to me.
[1] I'll refer to this logical bridge entity in another comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in line with what Russell says, we all agree that Network Observability is broader than just observing the network traffic and that this proposal only addresses one aspect of the whole.
Currently our focus is clearly on traffic observability but we don't want to loose track on the network logic (it's tracked in jira :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to focus on network traffic and then see how much we need to expand beyond this based on customer feedback.
The idea is that this enhancement looks at it from a cluster point-of-view instead of a device point-of-view. Perhaps the better term to use is "cluster" instead of "control plane".
|
||
### Implementation Details/Notes/Constraints | ||
|
||
Here are some of the limitations and constraints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the IPFIX exporter be configured in all OVS bridges [1]?
If the answer is no, I think there is a limitation of certain flows not being visible.
More generally, is the concept of "logical bridge" going to be abstracted away from the customer visualization? For instance, if you sample at two bridges, A and B, that are connected to each other you might detect a netflow flow on ingressA, that same flow on egressA, again on ingressB and finally egressB. The way these flows are de-duplicated can be considered an implementation detail but I'd like to understand how much "logical" visualization are we adding here.
[1] See other comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, at the moment only the int bridge is going to be used (ie. no change regarding the current implementation of enabling flow exports on int bridge from the ovn-k layer).
I agree it can be mentioned in the limitations, and potentially mentioned as an area of improvement in "future work"
### Visualization | ||
This is the proposal for the NetFlow table. | ||
|
||
<https://marvelapp.com/prototype/h4fei7h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the final set of columns or will more be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the current proposal and it intentionally kept the list small for simplicity. If there are other attributes that you feel are important, please suggest adding them to the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One field comes to mind: the Direction field which can have 2 values INGRESS/EGRESS.
Some flows are only visible on one of the directions, for instance if Pod A accesses Service S that has Pod B as an endpoint you would see:
Pod A -> Service S INGRESS
Pod A -> Pod B EGRESS
How would this be shown?
- If no Direction column is shown, it could lead to the (wrong) conclusion that Pod A is generating twice as much traffic.
- If we enforce one direction (i.e. INGRESS) and hide the other we would loose the perspective on the load balancing taking place.
Additionally, OVS provides the port name information. This might need translation (from port name to pod name) and could be considered redundant if we have the Pod Src/Dst Names derived from the IP addresses. But, looking at the amount of traffic that is sent to the geneve port might give you an idea of whether your application is well scaled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this comment. We will add the Direction field. Regarding the port name, let's see if that will be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general comment, the first few sections cast more high-level aspirational vision about network obvservability and then later in the doc it starts getting into a specific proposal about network flow collection and visualization. I would try to focus the document more on what's proposed to be implemented and not try to capture the broad potential scope of network observability overall.
You could consider a section on "future work" where you allude to what might come next, or what might build on the work proposed in this enhancement.
approvers: | ||
- "@russellb" | ||
- "@mcurry-rh" | ||
- "@bbennett" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to update this to ben's github username @knobunc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in fab47eb
Network Observability introduces a new category in OpenShift that | ||
provides networking information for a single cluster. It gives insight | ||
into what's on the network, when and what types of traffic and traffic | ||
flows are being made, and by whom. It gathers data to help design, plan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit, but how would you differentiate "traffic" and "traffic flows"? Should this just be "traffic flows" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traffic = IANA service name
Traffic flow = the actual NetFlow data which includes the traffic
It could just be "traffic flow" but wanted to call out the traffic part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents: if we stick with IPFIX terminology, that would be "flow template" and "flow record"
## Motivation | ||
|
||
With Kubernetes, a layer of abstraction is added making it difficult for | ||
Red Hat and customers who managed their networks to be able to fully see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Red Hat and customers who managed their networks to be able to fully see | |
Red Hat and customers who manage their networks to be able to fully see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in fab47eb
The goal of the first release is to lay the groundwork and foundation | ||
in place, while still being able to deliver some functionality, even | ||
if it is at a smaller scale. The target is to have a Dev Preview to | ||
generate interest in network observability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary, motivation, and goals are all pretty high level and aspirational. When we get down into the Proposal, we get down to the proposed initial scope. I think it would be helpful to clarify earlier in the doc that while Network Observability can be quite broad, you've explicitly chosen this subset to work on in this iteration. It could be worth explaining why network flow collection and visualization was chosen as the next area to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in Proposal in fab47eb
by a user with an admin or cluster-admin role. This is done by installing | ||
the Network Observability Operator, which in turn, installs the dependency | ||
operators necessary for this feature. The user can do this using the | ||
web console or the CLI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you summarize the dependencies here, as well? I wondered what "dependency operators" were referred to here, and then found the details in the later design details section. I think a quick summary here would still be helpful of the major components you will depend on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be updated to say it will require installation of Loki Operator. Similar to Cluster Logging, the user will need to install each one separately.
#### Network Observability Operator (NetObserv) | ||
The Network Observability Operator will need to be installed from OperatorHub | ||
to enable this feature. This operator has a dependency on the Loki | ||
Operator. This is an OpenShift Console dynamic plugin that is responsible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a console dynamic plugin? Not the operators ... so not sure what this refers to exactly.
Do you mean to say that the operator includes a console dynamic plugin? I'd look into how dynamic plugins are delivered, as I think they may need to be included in the console and not delivered as part of the operator? I'm not positive on that though.
It would help for this section to more clearly explain what the new operator will do to justify it. So far, I can guess:
- ensure the loki operator is installed?
- ensure a loki instance is installed and configured properly?
- automatically turn on and configure flow export from OVS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will follow the guidelines at https://github.com/openshift/console/tree/master/frontend/packages/console-dynamic-plugin-sdk. I will add the link to this document and ultimately, another link for the Network Observability Operator.
***Side note:***<br> | ||
This is a move away from Elasticsearch due to licensing restrictions. | ||
Open Distro for Elasticsearch was considered, but Loki was favored due to other | ||
components that plan to ultimately use Loki. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is going to own loki and the loki operator and ensure those get shipped and supported appropriately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster Logging team. Working on this...
|
||
Measuring ROI on network observability is difficult so it might be hard | ||
to justify the cost and resources to deploy this. It attempts to find | ||
and resolve issues that you might not know exists. The value you get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and resolve issues that you might not know exists. The value you get | |
and resolve issues that you might not know exist. The value you get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in fab47eb
to justify the cost and resources to deploy this. It attempts to find | ||
and resolve issues that you might not know exists. The value you get | ||
may not be obvious because it is hard to know how much you saved by | ||
preventing ransomware or by allocating the right amount of resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"preventing ransomware" was a bit out of left field for me, since security isn't really a focus throughout the enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed text:
The value you get may not be obvious because it is difficult to calculate how much you save by preventing something from happening such as a network failure.
|
||
## Alternatives | ||
|
||
Sticking with and enhancing traditional pinpoint tools. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are "traditional pinpoint tools" ?
You've also got some "alternatives considered" content scattered throughout that could all be moved here, like the mention of ElasticSearch, or the use of existing operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed text:
Provide scripts and standalone applications to enhance traditional pinpoint tools like pcap, traceroute, netstat, etc.
I will move the other alternatives here.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I know you're already working on implementing it here, but is there any chance to consider Pixie as an alternative approach? |
Pixie uses eBPF and that is something we will be looking at in the future as an alternative data source. We still want to provide NetFlow as a choice since that is what a number of customers are familiar with and willing to enable. |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
/lgtm |
Network Observability introduces a new category in OpenShift that provides networking information for a single cluster. It gives insight into what's on the network, when and what types of traffic and traffic flows are being made, and by whom. It gathers data to help design, plan, and answer questions about the network and provides a visual representation to help understand, diagnose, and troubleshoot networking issues.