-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Adding support/helpers/processors for XML in libbeat #23366
Comments
Pinging @elastic/integrations (Team:Integrations) |
Pinging @elastic/security-external-integrations (Team:Security-External Integrations) |
So, I don't necessarily think that
necessitates anything different than this:
Personally I would stay clear of adding in custom formatting logic to an input, at that point it'd be (IMO) a leaky implementation. We have inputs to handle literally that -- doing whatever communication/network/file processing required to actually grab messages from a source. All of the other transformation logic should get handled in a series of processors whether they are on the edge or in an ingest node. If this is implemented on the beats side of things, I'd be perfectly happy if we made a libbeat subpackage or common helper function to do the XML transformation and then, in practice, used it from a new processor, but I don't think that we should introduce a bunch of XML-specific logic to say the httpjson input, or create an httpxml input. As far as the priorities are concerned, my thought is that this would be quicker to release and more under our control if we implemented it as and edge processing (beats side) thing, but that ultimately something like this ought to go into an ingest node processor either instead of or in addition to the beats processor. I would think that most XML documents wouldn't have a ton of host-specific enrichments based off of content in the actual payload, so most of the document transformation would likely be fairly standalone and able to be done in an ingest processor. With this in mind, I kind of think we should do both a beats processor and an ingest processor, the beats processor first and the ingest processor as a follow-up. |
@andrewstucki thanks for the feedback, lots of great points as well! Would you mind elaborating a bit on the stance for custom logic on inputs? Similar as with the current httpjson and http_endpoint, there are certain actions that has to happen on input, the most important one being split operations. The purpose was less to add heavy logic to inputs, and more to just be able to convert an xml string to a map, then the rest of the input functions as usual. Other places where you might want some small logic would for example be a http response body. If we look at the current modules using these inputs then a high amount of them is dependant on being able to split. The helper is then just there to handle the conversion for you, without impacting logic in any way. |
@P1llus: I missed the point that "splitting" meant chunking a single payload into multiple documents. If that's the desire, yeah, I can't think of a way of doing it with a processor. In that case I'd say that if we're wanting to address a use-case that requires chunking a single XML payload into multiple documents, you're right that we should consider doing the lightweight processing in the input itself -- it's not ideal, but I can't think of another way of doing it. |
@P1llus are you thinking on an specific use case for this? I think that this would be easier to address with specific use cases we want to cover. Having full generic XML support may lead to a complex feature, that is difficult to develop, maintain and use. For example, supporting logs collection from files with newline-delimited XML objects has different requirements to periodically query an endpoint whose response is in XML format and extract some fields from there. For the same reasons we have different features for JSON. We have at least the options in the If you have some use case in mind, you could start by implementing specifically what you would need to support it. Is it enough with a processor and current inputs? Would you actually need to split events (and then do it at the input level)? Do you need to query periodically and extract metrics? While developing support for this use case you will see if there are some parts not provided by external libraries that are generic enough to be moved to libbeat, specially if you have a second use case in mind. Also having a specific use case (or not having it 🙂) will help deciding priorities. |
@jsoriano Thanks for taking the time! In regards to specific use cases, the way I approached it is setting some requirements down on what scope of functionality it should support, and have tested it with a few different sources. Usecases: Scanners like Qualys and Nessus in most use cases supports either XML only or XML+JSON type API's, exported results from vulnerability management scanners is also usually XML. There has been questions for quite some time around being able to ingest XML files or at least offer the possibility to parse XML to some extent, Windows Events are also often exported to XML when not using evtx. Scope: First of all, the purpose of the libbeat helper is only one purpose, might add a few extra helpers for QOL, but the general idea is that it takes a package common
import (
"github.com/clbanning/mxj/v2"
)
// UnmarshalXML takes a slice of bytes, and returns a map[string]interface{}
func UnmarshalXML(body []byte) (obj map[string]interface{}, err error) {
var xmlobj mxj.Map
// Disables attribute prefixes and forces all lines to lowercase to meet ECS standards
mxj.PrependAttrWithHyphen(false)
mxj.CoerceKeysToLower(true)
xmlobj, err = mxj.NewMapXml(body)
if err != nil {
return nil, err
}
err = xmlobj.Struct(&obj)
if err != nil {
return nil, err
}
return obj, nil
} This allows an input to convert XML data directly into something you can place in a The second part, is reusing this helper in multiple locations, for example modifying the http_endpoint to support XML: func httpReadObject(body io.Reader) (obj common.MapStr, status int, err error) {
if body == http.NoBody {
return nil, http.StatusNotAcceptable, errBodyEmpty
}
contents, err := ioutil.ReadAll(body)
if err != nil {
return nil, http.StatusInternalServerError, fmt.Errorf("failed reading body: %w", err)
}
isObject, objType := isObject(contents)
if !isObject {
return nil, http.StatusBadRequest, errUnsupportedType
}
if objType == "json" {
if err := json.Unmarshal(contents, &obj); err != nil {
return nil, http.StatusBadRequest, fmt.Errorf("Malformed JSON body: %w", err)
}
} else if objType == "xml" {
obj, err = common.UnmarshalXML(contents)
if err != nil {
return nil, http.StatusBadRequest, fmt.Errorf("Malformed XML body: %w", err)
}
} else {
return nil, http.StatusInternalServerError, errUnknownType
}
return obj, 0, nil
} And then the third part where the helper can be used, is in a func (x *xmlDecode) Run(event *beat.Event) (*beat.Event, error) {
var errs []string
for _, field := range x.config.Fields {
data, err := event.GetValue(field)
if err != nil && errors.Cause(err) != common.ErrKeyNotFound {
x.logger.Debugf("Error trying to GetValue for field : %s in event : %v", field, event)
errs = append(errs, err.Error())
continue
}
xmloutput, err := x.decodeField(field, data)
if err != nil {
x.logger.Errorf("failed to decode fields in xmldecode processor: %v", err)
}
var id string
if key := x.config.DocumentID; key != "" {
if tmp, err := common.MapStr(xmloutput).GetValue(key); err == nil {
if v, ok := tmp.(string); ok {
id = v
common.MapStr(xmloutput).Delete(key)
}
}
}
if field != "" {
_, err = event.PutValue(field, xmloutput)
} else {
jsontransform.WriteJSONKeys(event, xmloutput, x.config.ExpandKeys, x.config.OverwriteKeys, x.config.AddErrorKey)
}
if err != nil {
x.logger.Debugf("Error trying to Put value %v for field : %s", xmloutput, field)
errs = append(errs, err.Error())
continue
}
if id != "" {
if event.Meta == nil {
event.Meta = common.MapStr{}
}
event.Meta[events.FieldMetaID] = id
}
}
if len(errs) > 0 {
return event, fmt.Errorf(strings.Join(errs, ", "))
}
return event, nil
}
func (x *xmlDecode) decodeField(field string, data interface{}) (decodedData map[string]interface{}, err error) {
str := fmt.Sprintf("%v", data)
decodedData, err = common.UnmarshalXML([]byte(str))
if err != nil {
return nil, fmt.Errorf("error trying to decode XML field %v", err)
}
return decodedData, nil
} Conclusion: Example XML to JSON document:
Currently with the POC code results in: "message": {
"host_list_vm_detection_output": {
"id": "6506432"
}
}, |
@P1llus sounds good. If you have specific use cases I think that you could start by creating some draft PR for some of these cases and the specifics could be discussed there. Definitely we are going to need something like this if we want to support these scanners that only report using XML. |
@jsoriano That I can do! The helper is meant to handle any xml transformation in general, so we don't need to redo that implementation for each input, logic or processor that might want to use it. If you feel it makes it bloated I can always remove it, but that still require us to make the same logic multiple places instead then. The attributes itself means the same for all use cases, its just text attributes attached to a xml tag. Would need some directions on how we want to approach the helper though (or scrapping it), as it would impact any other PR I would create, starting with for example the XML processor? |
This issue is to discuss potential implementations for XML for beats. Looking through different open issues, there is plenty of places in which some sort of XML support would be beneficial.
However there are some pro's and con's to all of them, which is why I wanted to have this open discussions to get peoples viewpoint.
XML in general, using the XML encoder in golang does not support unmarshalling to a interface unlike JSON as a built-in feature, however there are libraries out there that takes care of a lot of that, also in terms of performance, the discussion however does not really need to focus on tooling, as the scope is more important at this stage.
As far as I see it, there is a few places in which we can add this:
1. Adding it as a new helper in libbeat common, similar to jsontransform and plenty of others.
Pro's:
The reason this is handy is to allow input developers to use the helper instead of either having to rewrite XML handling each time, or implementing different types of functionality.
Compared to a processor, handling the XML on input, before the queue is beneficial in many ways, for example processors does not support splitting of lists, which is a very common usecase when working on similar JSON structures, other usecases would be using the keys or values for any sort of conditional tagging, parsing or other transformations needed during ingest.
Con's:
Each input would need to manually add support for this.
2. Adding it as a new processor in libbeat, that allows any specific beat type to
Pro's:
Anyone can use it, just as with any other processor, makes it easy to cover a much larger scope
Con's:
Similar to the Pro's of above, does not make it possible to split or format the data beforehand.
3. Adding a XML processor for ingest pipeline
Pro's:
Anyone can use it, also outside of beats, similar to how the current Logstash XML filter functions.
Con's:
Currently ingest pipelines do not support splitting functionality, and the overhead created by XML is large, transforming it on the beat to JSON before sending would reduce the overhead significantly.
My own opinion on the subject is that all 3 is viable and useful, and could be implemented, but in order of ranking, I would use the same as the order above, especially since the helper created in libbeat could later be used in the processor definition as well.
Any thoughts, or thumbs up/down?
The text was updated successfully, but these errors were encountered: