[Discussion] Adding support/helpers/processors for XML in libbeat #23366

P1llus · 2021-01-05T19:03:38Z

This issue is to discuss potential implementations for XML for beats. Looking through different open issues, there is plenty of places in which some sort of XML support would be beneficial.

However there are some pro's and con's to all of them, which is why I wanted to have this open discussions to get peoples viewpoint.

XML in general, using the XML encoder in golang does not support unmarshalling to a interface unlike JSON as a built-in feature, however there are libraries out there that takes care of a lot of that, also in terms of performance, the discussion however does not really need to focus on tooling, as the scope is more important at this stage.

As far as I see it, there is a few places in which we can add this:

1. Adding it as a new helper in libbeat common, similar to jsontransform and plenty of others.
Pro's:
The reason this is handy is to allow input developers to use the helper instead of either having to rewrite XML handling each time, or implementing different types of functionality.
Compared to a processor, handling the XML on input, before the queue is beneficial in many ways, for example processors does not support splitting of lists, which is a very common usecase when working on similar JSON structures, other usecases would be using the keys or values for any sort of conditional tagging, parsing or other transformations needed during ingest.

Con's:
Each input would need to manually add support for this.

2. Adding it as a new processor in libbeat, that allows any specific beat type to
Pro's:
Anyone can use it, just as with any other processor, makes it easy to cover a much larger scope

Con's:
Similar to the Pro's of above, does not make it possible to split or format the data beforehand.

3. Adding a XML processor for ingest pipeline
Pro's:
Anyone can use it, also outside of beats, similar to how the current Logstash XML filter functions.

Con's:
Currently ingest pipelines do not support splitting functionality, and the overhead created by XML is large, transforming it on the beat to JSON before sending would reduce the overhead significantly.

My own opinion on the subject is that all 3 is viable and useful, and could be implemented, but in order of ranking, I would use the same as the order above, especially since the helper created in libbeat could later be used in the processor definition as well.

Any thoughts, or thumbs up/down?

elasticmachine · 2021-01-05T19:03:39Z

Pinging @elastic/integrations (Team:Integrations)

elasticmachine · 2021-01-05T19:03:39Z

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

andrewstucki · 2021-01-05T19:17:54Z

So, I don't necessarily think that

Adding it as a new helper in libbeat common, similar to jsontransform and plenty of others.

necessitates anything different than this:

Adding it as a new processor in libbeat

Personally I would stay clear of adding in custom formatting logic to an input, at that point it'd be (IMO) a leaky implementation. We have inputs to handle literally that -- doing whatever communication/network/file processing required to actually grab messages from a source. All of the other transformation logic should get handled in a series of processors whether they are on the edge or in an ingest node.

If this is implemented on the beats side of things, I'd be perfectly happy if we made a libbeat subpackage or common helper function to do the XML transformation and then, in practice, used it from a new processor, but I don't think that we should introduce a bunch of XML-specific logic to say the httpjson input, or create an httpxml input.

As far as the priorities are concerned, my thought is that this would be quicker to release and more under our control if we implemented it as and edge processing (beats side) thing, but that ultimately something like this ought to go into an ingest node processor either instead of or in addition to the beats processor. I would think that most XML documents wouldn't have a ton of host-specific enrichments based off of content in the actual payload, so most of the document transformation would likely be fairly standalone and able to be done in an ingest processor. With this in mind, I kind of think we should do both a beats processor and an ingest processor, the beats processor first and the ingest processor as a follow-up.

P1llus · 2021-01-05T20:01:59Z

@andrewstucki thanks for the feedback, lots of great points as well!

Would you mind elaborating a bit on the stance for custom logic on inputs?

Similar as with the current httpjson and http_endpoint, there are certain actions that has to happen on input, the most important one being split operations.
Since neither processors or ingest nodes can produce more than one document/event, the only place it can happen is on input.

The purpose was less to add heavy logic to inputs, and more to just be able to convert an xml string to a map, then the rest of the input functions as usual.

Other places where you might want some small logic would for example be a http response body.

If we look at the current modules using these inputs then a high amount of them is dependant on being able to split.

The helper is then just there to handle the conversion for you, without impacting logic in any way.

andrewstucki · 2021-01-05T20:07:49Z

@P1llus: I missed the point that "splitting" meant chunking a single payload into multiple documents. If that's the desire, yeah, I can't think of a way of doing it with a processor. In that case I'd say that if we're wanting to address a use-case that requires chunking a single XML payload into multiple documents, you're right that we should consider doing the lightweight processing in the input itself -- it's not ideal, but I can't think of another way of doing it.

jsoriano · 2021-01-06T11:12:57Z

@P1llus are you thinking on an specific use case for this?

I think that this would be easier to address with specific use cases we want to cover. Having full generic XML support may lead to a complex feature, that is difficult to develop, maintain and use.

For example, supporting logs collection from files with newline-delimited XML objects has different requirements to periodically query an endpoint whose response is in XML format and extract some fields from there. For the same reasons we have different features for JSON. We have at least the options in the log input, the decode_json processor, the httpjson input and the json metricset in the http module of Metricbeat.

If you have some use case in mind, you could start by implementing specifically what you would need to support it. Is it enough with a processor and current inputs? Would you actually need to split events (and then do it at the input level)? Do you need to query periodically and extract metrics? While developing support for this use case you will see if there are some parts not provided by external libraries that are generic enough to be moved to libbeat, specially if you have a second use case in mind.

Also having a specific use case (or not having it 🙂) will help deciding priorities.

P1llus · 2021-01-06T12:13:57Z

@jsoriano Thanks for taking the time! In regards to specific use cases, the way I approached it is setting some requirements down on what scope of functionality it should support, and have tested it with a few different sources.

Usecases:
First of all, while XML might not be as common anymore, there is a few areas in which it is still highly prominent, though most of these are related to security.

Scanners like Qualys and Nessus in most use cases supports either XML only or XML+JSON type API's, exported results from vulnerability management scanners is also usually XML.

There has been questions for quite some time around being able to ingest XML files or at least offer the possibility to parse XML to some extent, Windows Events are also often exported to XML when not using evtx.
SIEM sources also often supports or use primarily XML in their communications.

Scope:
I have already developed a fully working POC of a libbeat helper, a processor and an example implementation in http_endpoint to give some examples of scope, code is still WIP and requires cleanup

First of all, the purpose of the libbeat helper is only one purpose, might add a few extra helpers for QOL, but the general idea is that it takes a []byte in of the full XML that is to be parsed, unmarshal it and returns it as a map[string]interface{}. This allows any sort of XML handling in any beat or input in the future to easily handle that process, it's not supposed to be doing much more, though could be useful to add in for example the path of the data you want to unmarshal, so you can discard the rest.

package common

import (
	"github.com/clbanning/mxj/v2"
)

// UnmarshalXML takes a slice of bytes, and returns a map[string]interface{}
func UnmarshalXML(body []byte) (obj map[string]interface{}, err error) {
	var xmlobj mxj.Map
	// Disables attribute prefixes and forces all lines to lowercase to meet ECS standards
	mxj.PrependAttrWithHyphen(false)
	mxj.CoerceKeysToLower(true)
	
	xmlobj, err = mxj.NewMapXml(body)
	if err != nil {
		return nil, err
	}

	err = xmlobj.Struct(&obj)
	if err != nil {
		return nil, err
	}
	return obj, nil
}

This allows an input to convert XML data directly into something you can place in a beat.Event, or a common.MapStr

The second part, is reusing this helper in multiple locations, for example modifying the http_endpoint to support XML:

func httpReadObject(body io.Reader) (obj common.MapStr, status int, err error) {
	if body == http.NoBody {
		return nil, http.StatusNotAcceptable, errBodyEmpty
	}

	contents, err := ioutil.ReadAll(body)
	if err != nil {
		return nil, http.StatusInternalServerError, fmt.Errorf("failed reading body: %w", err)
	}

	isObject, objType := isObject(contents)
	if !isObject {
		return nil, http.StatusBadRequest, errUnsupportedType
	}

	if objType == "json" {
		if err := json.Unmarshal(contents, &obj); err != nil {
			return nil, http.StatusBadRequest, fmt.Errorf("Malformed JSON body: %w", err)
		}
	} else if objType == "xml" {
		obj, err = common.UnmarshalXML(contents)
		if err != nil {
			return nil, http.StatusBadRequest, fmt.Errorf("Malformed XML body: %w", err)
		}
	} else {
		return nil, http.StatusInternalServerError, errUnknownType
	}

	return obj, 0, nil

}

And then the third part where the helper can be used, is in a xmldecode or decode_xml_fields processor, something like this, using decode_json as a template:

func (x *xmlDecode) Run(event *beat.Event) (*beat.Event, error) {
	var errs []string

	for _, field := range x.config.Fields {
		data, err := event.GetValue(field)
		if err != nil && errors.Cause(err) != common.ErrKeyNotFound {
			x.logger.Debugf("Error trying to GetValue for field : %s in event : %v", field, event)
			errs = append(errs, err.Error())
			continue
		}

		xmloutput, err := x.decodeField(field, data)
		if err != nil {
			x.logger.Errorf("failed to decode fields in xmldecode processor: %v", err)
		}

		var id string
		if key := x.config.DocumentID; key != "" {
			if tmp, err := common.MapStr(xmloutput).GetValue(key); err == nil {
				if v, ok := tmp.(string); ok {
					id = v
					common.MapStr(xmloutput).Delete(key)
				}
			}
		}

		if field != "" {
			_, err = event.PutValue(field, xmloutput)
		} else {
			jsontransform.WriteJSONKeys(event, xmloutput, x.config.ExpandKeys, x.config.OverwriteKeys, x.config.AddErrorKey)
		}

		if err != nil {
			x.logger.Debugf("Error trying to Put value %v for field : %s", xmloutput, field)
			errs = append(errs, err.Error())
			continue
		}
		if id != "" {
			if event.Meta == nil {
				event.Meta = common.MapStr{}
			}
			event.Meta[events.FieldMetaID] = id
		}
	}

	if len(errs) > 0 {
		return event, fmt.Errorf(strings.Join(errs, ", "))
	}
	return event, nil
}

func (x *xmlDecode) decodeField(field string, data interface{}) (decodedData map[string]interface{}, err error) {
	str := fmt.Sprintf("%v", data)
	decodedData, err = common.UnmarshalXML([]byte(str))
	if err != nil {
		return nil, fmt.Errorf("error trying to decode XML field %v", err)
	}

	return decodedData, nil
}

Conclusion:
It's not there to support every niche usecase, advanced attributes or conversion of custom or complex datatypes, it is simply there to convert a complete XML string/document into something we can directly insert into a beats event.

Example XML to JSON document:
testxml.xml:

<HOST_LIST_VM_DETECTION_OUTPUT>
  <ID>
    6506432
  </ID>
</HOST_LIST_VM_DETECTION_OUTPUT>

Currently with the POC code results in:

"message": {
    "host_list_vm_detection_output": {
      "id": "6506432"
    }
  },

jsoriano · 2021-01-06T12:41:08Z

@P1llus sounds good. If you have specific use cases I think that you could start by creating some draft PR for some of these cases and the specifics could be discussed there. Definitely we are going to need something like this if we want to support these scanners that only report using XML.
Regarding the helper, not sure if this is going to be common enough, for example I guess that attributes will mean different things in different services.

P1llus · 2021-01-06T14:14:52Z

@jsoriano That I can do!

The helper is meant to handle any xml transformation in general, so we don't need to redo that implementation for each input, logic or processor that might want to use it. If you feel it makes it bloated I can always remove it, but that still require us to make the same logic multiple places instead then.

The attributes itself means the same for all use cases, its just text attributes attached to a xml tag.
Looking in the commons folder on libbeat we have several of these small helper functions, so I thought it might be good to do something similar here?

Would need some directions on how we want to approach the helper though (or scrapping it), as it would impact any other PR I would create, starting with for example the XML processor?

P1llus added libbeat Team:Integrations Label for the Integrations team Team:Security-External Integrations labels Jan 5, 2021

jsoriano added the discuss Issue needs further discussion. label Jan 6, 2021

P1llus mentioned this issue Jan 26, 2021

[Libbeat][New Processor] XML Decode #23678

Merged

6 tasks

P1llus closed this as completed in #23678 Feb 15, 2021

P1llus mentioned this issue Feb 15, 2021

Cherry-pick #23678 to 7.x: [Libbeat][New Processor] XML Decode #24049

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Adding support/helpers/processors for XML in libbeat #23366

[Discussion] Adding support/helpers/processors for XML in libbeat #23366

P1llus commented Jan 5, 2021 •

edited

Loading

elasticmachine commented Jan 5, 2021

elasticmachine commented Jan 5, 2021

andrewstucki commented Jan 5, 2021

P1llus commented Jan 5, 2021

andrewstucki commented Jan 5, 2021

jsoriano commented Jan 6, 2021

P1llus commented Jan 6, 2021 •

edited

Loading

jsoriano commented Jan 6, 2021

P1llus commented Jan 6, 2021

[Discussion] Adding support/helpers/processors for XML in libbeat #23366

[Discussion] Adding support/helpers/processors for XML in libbeat #23366

Comments

P1llus commented Jan 5, 2021 • edited Loading

elasticmachine commented Jan 5, 2021

elasticmachine commented Jan 5, 2021

andrewstucki commented Jan 5, 2021

P1llus commented Jan 5, 2021

andrewstucki commented Jan 5, 2021

jsoriano commented Jan 6, 2021

P1llus commented Jan 6, 2021 • edited Loading

jsoriano commented Jan 6, 2021

P1llus commented Jan 6, 2021

P1llus commented Jan 5, 2021 •

edited

Loading

P1llus commented Jan 6, 2021 •

edited

Loading