Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New component: IPFIX Lookup #28692

Open
2 tasks
fizzers123 opened this issue Oct 30, 2023 · 14 comments
Open
2 tasks

New component: IPFIX Lookup #28692

fizzers123 opened this issue Oct 30, 2023 · 14 comments
Labels
Sponsor Needed New component seeking sponsor Stale

Comments

@fizzers123
Copy link

fizzers123 commented Oct 30, 2023

The purpose and use-cases of the new component

Allow traces to be enhanced by IPFIX information stored in an ElasticSearch cluster.

A very similar functionality was already suggested once in February 2023 #18270. We would be interested in contributing our code here.

Example configuration for the component

CorrelationUnitv3 drawio

processors:
  groupbytrace:
    wait_duration: 100s
    num_traces: 1000
    num_workers: 2
  ipfix_lookup:
    elastic_search:
      connection: 
        addresses:
          - https://<elastic_ip>:30200/
        username: elastic
        password: <password_here>
        certificate_fingerprint: <cert_fingerprint_here>
    timing:
      lookup_window: 120
    # # OPTIONAL settings:
    # query_parameters:
    #   base_query:
    #     field_name: input.type
    #     field_value: netflow
    #   device_identifier: "fields.observer\\.ip.0"
    #   lookup_fields:
    #     source_ip: source.ip
    #     source_port: source.port
    #     destination_ip: destination.ip
    #     destination_port: destination.port
    # span_attribute_fields:
    #   - "@this"
    #   - "fields.event\\.duration.0"
    #   - "fields.observer\\.ip.0"
    #   - "fields.source\\.ip.0"
    #   - "fields.source\\.port.0"
    #   - "fields.destination\\.ip.0"
    #   - "fields.destination\\.port.0"
    #   - "fields.netflow\\.ip_next_hop_ipv4_address"
    # spans:
    #   span_fields:
    #     source_ips:
    #       - net.peer.ip
    #       - net.peer.name
    #       - src.ip
    #     source_ports:
    #       - net.peer.port
    #       - src.port
    #     destination_ip_and_port:
    #       - http.host
    #     destination_ips:
    #       - dst.ip
    #     destination_ports:
    #       - dst.port  
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [groupbytrace, ipfix_lookup]
      exporters: [otlp/jaeger, debug]
  telemetry:
    logs:
      level: debug

Telemetry data types supported

traces

Is this a vendor-specific component?

  • This is a vendor-specific component
  • If this is a vendor-specific component, I am proposing to contribute and support it as a representative of the vendor.

Code Owner(s)

No response

Sponsor (optional)

No response

Additional context

As part of our Bachelor thesis at the Eastern Switzerland University of Applied Sciences we have created a basic implementation of this functionality.

finnal-implementation

(The network was intentionally slowed down for this screenshot)

ipfix_lookup processor

Inside the OptenTelemetry pipeline, a new processor called ipfix_lookup can be configured. Before the IPFIX lookup is performed, all the traces are grouped together, and a delay is added by the groupbytrace processor. The groupbytrace will group all the incoming spans by trace and wait for the wait_duration until forwarding it to the ipfix_lookup processor.

Inside the ipfix_lookup processor each trace span is then checked to see if the IP and port quartet can be extracted. When the four values (source.ip, source.port, destination.ip, destination.port, observer.ip) are found, the corresponding flow is searched in ElasticSearch. For the time frame of the search, two considerations must be made.

Firstly, there is an ingest delay in any large distributed search engine. Because of this, the spans need to be pre-processed by the groupbytrace processor. The delay can be defined in the processors.groupbytrace.wait_duration value. Afterwards, the search can be started. The time window that will be searched can be configured in the processors.ipfixLookup.timing.lookupWindow. To keep the processor simple, the lookupWindow is added before the start timestamp and after the end timestamp. This way, the chance that the Netflow/IPFIX records leading or being caused by this span is found is maximized.

summary span

A summary span was added to simplify the display of the spans in Jaeger, under which all Netflow/IPFIX spans are placed. As depicted in the screenshot the summary span is highlighted yellow and contains the TCP IP quartet in the name. Both request and response are grouped under the same summary span.

The summary span improves the ipfix_lookup processor as it can be split into two separate actions. First, the trace will be checked for the IP/Port quartet, and summary spans will be created. In the second step, the processor iterates through.

@djaglowski
Copy link
Member

Would you mind making a case for this being a connector vs a processor or receiver?

The only reason I bring up receiver as an option is because this was proposed previously and no argument was made against it. (Perhaps there is an obvious one but I'm not familiar with the protocol.)

If not a receiver, why not a processor? If I'm understanding correctly, it would only support traces. Therefore it's not clear that it needs to be a connector.

@fizzers123
Copy link
Author

Hi @djaglowski

The reason it would be impossible to implement as a receiver is the fact that the context propagation information can not be extracted out of the NetFlow logs. The NetFlow/IPFIX logs only provide information up to OSI Layer 4 and context propagation like the traceparent header is at OSI Layer 7.

The reason a connector was chosen is that new spans are inserted into an existing trace. With a processor such a modification would have required the workaround described in the Why use a Connector? guide. (if i understood correctly)

Historically, some processors transmitted data by making use of a work-around that follows a bad practice where a processor directly exports data after processing.
https://opentelemetry.io/docs/collector/build-connector/#why-use-a-connector

@djaglowski
Copy link
Member

The reason a connector was chosen is that new spans are inserted into an existing trace. With a processor such a modification would have required the workaround described in the Why use a Connector? guide. (if i understood correctly)

Historically, some processors transmitted data by making use of a work-around that follows a bad practice where a processor directly exports data after processing.

I think there are two possible concerns to parse through here.

The first, as you cited, I think is not the same problem which is described there. That pattern was problematic because it emitted data directly to exporters, which meant there was no further opportunity to process the data. In this case, it would be possible to inject the generated spans directly into the original data stream (or replace the original altogether) and then continue processing both from there e.g. receiver -> proc 1 -> ipfixlookup -> proc 2 -> proc 3 -> exporter.

That said, the second consideration here is whether or not it is actually appropriate to do either of the above (replace the original data, or mix the generated into the original). In most situations, I would lean towards keeping generated data stream separate from the original data stream. This gives the user full control over whether to keep the original stream, keep both separate, or mix the two.

However, in this case you mentioned that we'd be generating spans which are part of the same trace. This sounds a lot like the generated and original data meaningfully belong together, but again I'm not familiar enough with the protocol to determine this. I think it would be helpful if you could clarify the following:

  1. Is the original data intended to be replaced by the generated data? Or, is it at least sometimes useful to keep both?
  2. If the answer to 1 is no (keep both generated and original data), do you think users may want to process the generated and original streams differently? Or, do you think both streams will generally be processed the same way?
  3. If the answers to 1 and 2 are no (keep both, process the same), is there any specific reason why the generated and original streams are semantically different, such that users should keep them separate?

@fizzers123
Copy link
Author

The generated data are new IPFIX spans, which are part of an existing trace of spans. No original data is modified. Only new spans are added.

  1. It makes only sense that both are kept. The new IPFIX spans are of little value without the original trace.
  2. I can't think of a use case where splitting the steams makes sense. Therefore, I believe they will generally be processed the same way.
  3. What exactly do you mean by semantically different?
    Would you consider spans from a Java app versus spans from a Reverse Proxy semantically different?
    3.1 If yes, the IPFIX spans should be kept separate. The IPFIX spans are just another source of spans.
    3.2 If no, they can be handled together with all the other spans.

@djaglowski
Copy link
Member

  1. It makes only sense that both are kept. The new IPFIX spans are of little value without the original trace.
  2. I can't think of a use case where splitting the steams makes sense. Therefore, I believe they will generally be processed the same way.

Thanks, based on these, I think a processor is probably appropriate. The only case where it would not be in my opinion would be based on the third question.

  1. What exactly do you mean by semantically different?
    Would you consider spans from a Java app versus spans from a Reverse Proxy semantically different?
    3.1 If yes, the IPFIX spans should be kept separate. The IPFIX spans are just another source of spans.
    3.2 If no, they can be handled together with all the other spans.

I didn't explain this well but basically I'm asking if there's some other reason not to add the generated data directly into the original data stream. It sounds like there isn't a problem, so I would still that a processor is appropriate here.

@ubaumann
Copy link

Maybe to explain the use case (as far as I understand ;))

IPFIX or Netflow are telemetry data about the network packet flow. So, in this case, the ELK Stack contains all the metadata from the packages sent through the network. This provides a lot of observational information. With the right queries, you can see the path a single network package took.

This project now aims to aggregate an Application trace with the exact network information. If I am looking in Jagger at an API call, I usually see all the telemetry data from the application SDK. With this approach, the application trace gets aggregated with the network information for this particular package. I would see how long the API function is running, the function would make a DB or backend API call, and I would not only see how long it takes to get to the DB/Backend, I would see the exact path my restest took over the network. This could show that we have performance issues when the path goes to the second load balancer or if any other network connection would cause some issues.
The goal is to add only the application-generated traffic of the network telemetry pool to open telemetry.

I am really interested to see this coming true as a user. The application guys blaming the network would finally be much less :D
What this approach makes unique is extracting the network metadata from the rest sent (source and destination IP and source and destination port) loo,kup at the exact path, and adding it to the application trace.

What would be the problem/impact if something is implemented as a processor or connector?

@djaglowski
Copy link
Member

What would be the problem/impact if something is implemented as a processor or connector?

As far as I can tell, there wouldn't really be a difference for solving the use case, which I why I suggest a processor instead of a connector. A processor is easier to implement and more importantly easier to configure because you don't have to worry about hooking pipelines up to one another. Connectors are great for certain things but I unless I'm missing something I think it should be unnecessary and therefore unnecessarily complicated.

@fizzers123 fizzers123 changed the title New component: Ipfixlookupconnector New component: IPFIX Lookup Nov 20, 2023
@SuniAve
Copy link

SuniAve commented Nov 20, 2023

Hi @djaglowski

I work together with @fizzers123 on this project.
Thanks for your input. We agree that a processor would be the right component for our goal. We have migrated our code from a connector to a processor and are now in the process of further improvements.

This is an updated version of the illustration:
CorrelationUnitv2 drawio

@fizzers123
Copy link
Author

We have updated the implementation quiet a bit and published our code here: https://github.com/fizzers123/opentelemetry-collector-contrib/tree/ipfix-processor-implementation/processor/ipfixlookupprocessor.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sponsor Needed New component seeking sponsor Stale
Projects
None yet
Development

No branches or pull requests

5 participants