Skip to content

Commit

Permalink
Address review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
stleerh authored Oct 12, 2021
1 parent 1128b21 commit fab47eb
Showing 1 changed file with 58 additions and 60 deletions.
118 changes: 58 additions & 60 deletions enhancements/network-observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,17 @@ authors:
reviewers:
- "@russellb"
- "@mcurry-rh"
- "@bbennett"
- "@knobunc"
- "@amorenoz"
- "@eraichst"
- "@eparis"
- "@spadgett"
approvers:
- "@russellb"
- "@mcurry-rh"
- "@bbennett"
- "@knobunc"
creation-date: 2021-09-22
last-updated: 2021-09-22
last-updated: 2021-10-11
status: implementable
see-also:
replaces:
Expand Down Expand Up @@ -48,11 +48,11 @@ to help understand, diagnose, and troubleshoot networking issues.
## Motivation

With Kubernetes, a layer of abstraction is added making it difficult for
Red Hat and customers who managed their networks to be able to fully see
Red Hat and customers who manage their networks to be able to fully see
what's happening on their network. Monitoring provides metrics and
alerts to potential problems. Network observability will then help you
analyze, investigate and diagnose those problems by looking at it from
a control plane perspective instead of device by device. In addition,
a centralized perspective instead of device by device. In addition,
it can assist in the areas of network planning, network policy validation,
security, and others.

Expand Down Expand Up @@ -82,21 +82,25 @@ for tracing.

## Proposal

Network Observability is an opt-in feature that needs to be enabled
Network Observability covers a broad area. In the first release,
it will focus on obtaining and storing NetFlow data and providing
visualization for this. This is a logical first step in showing what is
happening on your network.

Network Observability will be an opt-in feature that needs to be enabled
by a user with an admin or cluster-admin role. This is done by installing
the Network Observability Operator, which in turn, installs the dependency
operators necessary for this feature. The user can do this using the
web console or the CLI.
the Network Observability Operator and Loki Operator. The user can do this
using the web console or the CLI.

In the initial release, Open vSwitch (OVS) will be configured to export
IPFIX data. The data will be collected and combined with Kubernetes-related
information (e.g. pod, services, namespaces) and then saved in internal or
external storage.
Open vSwitch (OVS) will be configured to export IPFIX data. The data will be
collected and combined with Kubernetes-related information (e.g. pod, services,
namespaces) and then saved in local persistent storage or cloud storage such
as Amazon S3.

The web console will provide a NetFlow table showing traffic between
pods. In the future, more visualization and functionality will be
provided including areas such as topology, network-related eBPF data,
policy validation, security risks, and more.
pods in a table format. In the future, more visualization and functionality
will be added to include areas such as topology, network data gathered using
eBPF, policy validation, security risks, and more.

***Note:*** *The term "NetFlow" is used generically throughout this document and
is synonymous with IPFIX, which is the IETF version of NetFlow.*
Expand All @@ -111,10 +115,12 @@ is synonymous with IPFIX, which is the IETF version of NetFlow.*
2. As a cluster/network admin, I need basic search and filtering tools to analyze a subset of NetFlow data inside the OpenShift console.
- Users can filter NetFlow based on traffic source or destination (ie, view all of the traffic associated with a particular node, service, namespace, pod, etc.).

User story 1 will be partially address as it provides the data for debugging.
However, the first release will not analyze why something went wrong.

### Implementation Details/Notes/Constraints

Here are some of the limitations and constraints.
Here are the limitations and constraints.

1. CNI must be OVN-Kubernetes<br>
The network type (CNI) has to be OVN-Kubernetes since configuring OVS to
Expand Down Expand Up @@ -146,30 +152,20 @@ can be controlled by the data retention policy or how long to keep the data.

Nevertheless, in order to maintain the same performance that you have
prior to enabling network observability, will require additional resources.
The specific number of cores and the amount of memory and storage required
will be finalized as more testing is done.


### Risks and Mitigations

1. Resource Risk<br>
Not having enough resources (e.g. CPU) can be disruptive to the system,
and care must be taken to not jeopardize the health of the network by enabling
network observability.

2. Privacy<br>
Operators need to be aware of what data is being collected as it might
expose customers to privacy issues. In sensitive cases, storing IP addresses
is problematic as it can reveal the user's general location and can be used
to find out other information about the user, including possibly the user's
identity.

3. Cost<br>
It may not be clear how much additional cost the customer will incur,
particularly if this is managed by Red Hat. Additional nodes may need to
be added to avoid the resource risk described above.

4. Data accuracy<br>
If data sampling is turned on, then data accuracy cannot be relied on
so using this for audit reports or criminal investigation are not viable.
| Risk | Mitigation |
| --- | --- |
| Resource Risk<br> Not having enough resources (e.g. CPU) can be disruptive to the system, and care must be taken to not jeopardize the health of the network by enabling network observability. | Minimum resource requirements will be documented for CPU, memory, storage, and bandwidth. |
| Privacy<br> Operators need to be aware of what data is being collected as it might expose customers to privacy issues. In sensitive cases, storing IP addresses is problematic as it can reveal the user's general location and can be used to find out other information about the user, including possibly the user's identity. | Mask out sensitive data. This will prevent searching on this data. |
| Cost<br> It may not be clear how much additional cost the customer will incur, particularly if this is managed by Red Hat. Additional nodes may need to be added to avoid the resource risk described above. | Accounts managed by Red Hat can have an upper bound limit on how much can be spent on infrastructure. |
| Data accuracy<br> If data sampling is turned on, then data accuracy cannot be relied on so using this for audit reports or criminal investigation are not viable. | Turn off sampling. |
| Scalability<br> Need support for more than 20K NetFlows per second (which is the target for the first release) | The sampling rate can be increased.<br>In addition, the plan is add Apache Kafka to scale the flow collection and storage. |
| Internal flows are not visible<br>Only the internal bridge athe OVN-Kubernetes layer is enabled so internal flows are not visible. | This is possible to do but will be deferred to a later release. |


## Design Details
Expand Down Expand Up @@ -208,7 +204,7 @@ information. It watches for updates on pods, services and daemonsets
using the Kubernetes API. It merges data from Flow Collector and
Kubernetes to be able to display pod-to-pod traffic. It is also
responsible for writing out the data to Storage. See Network Observability
Flow Enricher enhancement for more details.
Flow Enricher enhancement for more details (TBD).

#### Storage
Storage stores the IPFIX/Kubernetes data. It provides a REST API to
Expand All @@ -225,24 +221,17 @@ the Network Observability Operator and the Loki Operator.
#### Network Observability Operator (NetObserv)
The Network Observability Operator will need to be installed from OperatorHub
to enable this feature. This operator has a dependency on the Loki
Operator. This is an OpenShift Console dynamic plugin that is responsible
for defining resources and enabling collection of NetFlow similar to
Red Hat OpenShift Logging. This can be done from Web Console or CLI.
Operator. The operator includes an [OpenShift Console dynamic plugin](https://github.com/openshift/console/tree/master/frontend/packages/console-dynamic-plugin-sdk)
that is responsible for defining resources and enabling collection of
NetFlow similar to Red Hat OpenShift Logging. This can be done from
Web Console or CLI.

The operator will follow the [Operator Lifecycle Manager](https://olm.operatorframework.io/)
(OLM) model which is a component of the [Operator Framework](https://github.com/operator-framework).
This allows Network Observability to release on its own cycle, although the
plan is to follow the OCP's release dates and leverage this benefit
for hot fixes only.

***Side note:***<br>
Consideration was evaluated in terms of *not* introducing a new Network
Observability Operator. It didn't make sense to combine with Cluster
Network Operator or Cluster Monitoring Operator since network observability
is an opt-in feature that requires resources. Combining with Red Hat
OpenShift Logging Operator (formerly Cluster Logging Operator) will limit
the ability to enable one and not the other.

#### Loki Operator
The Loki Operator is a separate project at
https://github.com/ViaQ/loki-operator but is required for network
Expand All @@ -251,11 +240,6 @@ used to store NetFlows. It will be installed in its own namespace with the
intention that if another component wants to use Loki, it should create its
own instance.

***Side note:***<br>
This is a move away from Elasticsearch due to licensing restrictions.
Open Distro for Elasticsearch was considered, but Loki was favored due to other
components that plan to ultimately use Loki.


### Visualization
This is the proposal for the NetFlow table.
Expand Down Expand Up @@ -292,7 +276,7 @@ can be updated independently.
A few resource-related parameters can be changed, but all values will
have some default so that enabling network observability will be as
simple as possible. For more details, see the enhancement on Network
Observability - Installation and Packaging,
Observability - Installation and Packaging (TBD).


### Test Plan
Expand Down Expand Up @@ -350,14 +334,28 @@ TBD

Measuring ROI on network observability is difficult so it might be hard
to justify the cost and resources to deploy this. It attempts to find
and resolve issues that you might not know exists. The value you get
may not be obvious because it is hard to know how much you saved by
preventing ransomware or by allocating the right amount of resources
for your network without causing failures.
and resolve issues that you might not know exist. The value you get
may not be obvious because it is difficult to calculate how much you
save by preventing something from happening such as a network failure.

## Alternatives

Sticking with and enhancing traditional pinpoint tools.
Rather than having network observability, the troubleshooting aspect
can be done by creating scripts and standalone applications to enhance
traditional pinpoint tools like pcap, traceroute, netstat, etc. This
does not provide a centralized solution nor address the non-troubleshooting
issues, and would be difficult to make each tool GUI-based and consistent.

Regarding introducing a new Network Observability Operator, alternatives
were considered. It didn't make sense to combine with Cluster Network
Operator or Cluster Monitoring Operator since network observability
is an opt-in feature that requires resources. Combining with Red Hat
OpenShift Logging Operator (formerly Cluster Logging Operator) will limit
the ability to enable one and not the other.

For storage, Elasticsearch was not chosen due to licensing restrictions.
Open Distro for Elasticsearch was considered, but Loki was favored due to
other components in Web Console that plan to ultimately use Loki.

## Infrastructure Needed

Expand Down

0 comments on commit fab47eb

Please sign in to comment.